Wendy Gonzalez wants to reduce inequality. In 2015, she left the software-as-a-service startup she co-founded and joined nonprofit data labeling service Sama to do just that.
Gonzalez is the CEO of the startup, which is driven by a mission to expand opportunities for underserved individuals through the digital economy by offering work in Uganda and Kenya. Sama labels images to create training data for AI models. The company has hired more than 12,800 people from marginalized backgrounds since its start in 2008.
That mission is paying off. Customers of Sama include Google, Microsoft, Walmart, Glassdoor, and Getty Images.
Sama, based in San Francisco, in 2018 relaunched as a venture-backed for-profit with the nonprofit a majority shareholder, and the next year it raised $14.8 million in Series A funding. In 2020, Sama became one of the first AI companies to be B corporation certified.
Sama is a member of NVIDIA Inception, a program designed to nurture startups revolutionizing industries with advancements in AI and data science.
Platform for Data Scientists
Sama offers a market-leading data annotation platform accelerated by NVIDIA GPUs. Its customers also get an opportunity to partner with a socially responsible business and work with underserved communities.
The startup has quickly grown to meet demand. Its year-over-year revenue has been tripling, according to the company. Its corporate office has grown from 45 employees two years ago to more than 200 today, along with some 4,000 employees with benefits in East Africa.
“We do very specific hiring practices in underserved communities,” she said. “Because at the end of the day, business can be a force for social good,” said Gonzalez.
Sama’s nonprofit affiliate — the Leila Janah Foundation — further supports underserved communities through the Give Work Challenge, a program that supports new and early stage businesses in East Africa through funding and mentorship.
Data Annotation at Scale
Sama’s proprietary machine learning platform combined with its “human in the loop” data annotation experts offers full service, from preparing data to dedicated account team support.
The company says its assisted annotation platform with ethical human validation offers data accuracy ranging 95 percent to 99 percent, outperforming competitors.
Sama relies on NVIDIA V100 Tensor Core GPUs for training and NVIDIA T4 Tensor Core GPUs for inference.
Using the NVIDIA TAO Toolkit for transfer learning, Sama found that in preliminary testing it achieves as much as a sixfold improvement in efficiency on labeling datasets. NVIDIA TAO compresses development time by enabling developers — or even those with limited technical expertise — to fine-tune on high-quality pretrained models from NVIDIA with only a fraction of the data compared with training from scratch.
“The real magic of TAO was that non-ML engineers were able to build a model without involving the engineering and research teams,” said Gonzalez.
Sama says its platform already offers higher accuracy at half the cost and twice the speed compared with top competitors.
“On top of market-leading accuracy and speed in delivering vast quantities of annotated data for Fortune 500 customers, Sama’s hiring model provides an unmatched 4 percent attrition rate and business continuity in terms of account team support for our customers,” said Gonzalez.
Watching AI for Data Bias
In addition to the social benefits of helping reduce inequality through its hiring practices, Sama aims to address data bias. Diverse datasets can help ensure that AI models can be trained so that features work for everyone and can help mitigate risks for companies deploying them.
“We’ve done external audits to prove the efficacy of the models,” said Gonzalez.
Bias can show up in datasets because those assembling them may lack a diversity of viewpoints on subjects, she points out.
Race, gender and age bias in datasets are just some examples of the ways bias can show up. Assembling a diverse data team — like Sama’s — is one way to counter bias, points out Gonzalez.
“But bias can also be against motorcycles, like not having a representation of them in a transportation dataset,” she said.