Innovation Challenges

Expert Icon

SAP's Data Anonymization Challenge

SAP
Owner: SAP
Tags: Data Anonymisation, Data Analytics, Machine Learning, AWS, Deep Learning, Ubuntu, Identification
Audience: Students, Researchers, Start-Ups, Businesses, Subject Matter Experts, Other
Anticipated Funding: $30,000
Applications close: December 20, 2019
Last Updated: September 27, 2019 5:52 AM
Challenge image

The Challenge

Access to data is a major differentiator for businesses in today's global marketplace and can be instrumental in breaking down data silos, extracting business intelligence through machine learning and using AI-driven insights to deliver better experiences. However, a critical consideration to moving forward with these opportunities is assuring data privacy.

Semi-structured text documents are an essential part of many business processes, for example invoices, sales orders, or payment advises. Translating these semi-structured data into structured data is essential to allow further downstream processing and automation. 

To foster research and development of machine learning approaches for document processing, it is necessary to allow researchers to work with large amounts of realistic documents. To comply with data protection regulation, companies have to anonymize documents and remove any personally identifying information. This redaction should be done in a way that produces realistic-looking documents and minimizes negative impact on machine learning model training.

Openness is a key principle for SAP as it is the foundation for co-innovation and integration. We are embracing open standards and open source, and are providing rapid access to data and business processes through open APIs, so customers and partners can turn data into value as easy as possible. In return, it is crucial for SAP to use the power and speed of communities to innovate even faster. In this spirit, the winning solutions from the challenge will be open-sourced. Openness creates more value for everybody.

In this challenge, you will work with a set of 25000 invoices from the public RVL-CDIP DatasetRVL-CDIP Dataset [1,2]. Some of the invoices are (low quality) scans or contain handwritten notes. You are also welcome to use other datasets that are available to you to train your model and maximize its generalizability.

Your tasks are as follows:

  1. Build a model that can identify the bounding boxes of the following types of personally identifying information:
    1. Personal names,
    2. Personal addresses (i.e. addresses that do not contain a business name)
    3. Phone numbers,
    4. Signatures,
    5. Handwritten notes, since we cannot rule out that they contain personally identifiable information.
  2. Develop a system to redact the content of the bounding boxes with a realistic replacement such that the anonymized data remains effective training data for machine learning tasks. This will require efforts to preserve the style, orientation, imperfections and complexities of the original data. Anonymized substitute text that is simpler and clearer to read than the original data will result in a less comprehensive and less effective training data set.

As ground truth for your models training you will be provided with a sample of bounding boxes of the personally identifying information as specified above.


Eligibility Criteria

For complete details on the eligibility criteria that governs SAP's Data Anonymization Challenge please refer to the website.

Terms and Conditions

For complete details on the terms and conditions that governs SAP's Data Anonymization Challenge please refer to the website.

How to Apply

To submit an application to SAP's Data Anonymization Challenge please refer to the website.

Want to know more or not sure where to start?
Click on the page most relevant to you or learn more about how you can get involved:

Partners

The Department of Industry, Innovation and Science
The Conversation Media Group
IP Australia
Clarivate Analytics
The Australian Research Council
ACT Government
Australian Access Federation
Australian National University
https://aprintern.org.au/
https://ardc.edu.au/
BHERT
CBR Innovation Network
Coalfacer
https://www.nera.org.au/
http://www.oninnovation.com.au/
ORCID
University of Canberra
University of Technology Sydney
thinkable
Western Sydney University
https://womeninscienceaust.org/