By Rebecca Bilbro
Lead Data Scientist, ByteCubed
Not so long ago, ByteCubed was a small D.C. startup founded on the idea that the contracting space was in need of major refactoring. As many in the D.C. area know, contracts tend to be way more expensive and also more inflexible than is strictly necessary. The developers supporting these contracts work in the dark, many degrees removed from the feedback they need, and clients often wait years to see any fruits of that labor.
ByteCubed is designed to be different. While it’s slightly bigger and more established now, we still believe contracts can be lean, agile, focused, innovative, and profitable enough to keep the lights on.
What We Do Here
At ByteCubed, we specialize in custom software to support real time decision-making. We take a practical approach to problem solving, leveraging data science as another useful means of gaining insight into a given problem space. Regardless of what that problem space is – educational and social disparity, corporate entity relationships, grant processing and management – we wrangle data, process natural language, and incorporate machine learning to create products that can help clients make better business decisions.
We work with our clients to identify opportunities where software can augment manual processes, provide a richer set of solutions, and enable users to understand the past and anticipate the future. We leverage open source tools like Python, Java, Spark, PostgreSQL, and MongoDB because they not only save our clients development time, but are transparent, community-tested and -approved tools that are fundamentally safer to use.
What “Data Science” Means to Us
We believe data science isn’t magic. It’s a principled practice. At ByteCubed, our data science team is most heavily focused on data wrangling, natural language processing, and machine learning. We get to work with a lot of different kinds of data: unstructured text from the wild web, relational datasets from commercial providers, semi-structured data from public APIs. But on a practical level, whether we’re working with social media data or corporate entity data, we’re looking for ways to make it easier for analysts to do what they already do – apply their domain expertise to look up data, draw connections, anticipate shifts and changes, make hard business decisions – but do it faster, easier, smarter, and in a more scalable way.
What We’re Really Good At
Data science is a big space. At ByteCubed, our focus is mainly on natural language processing, recommender systems, unsupervised learning, and graph-based models.
Natural language processing
From newspaper articles and speeches to informal conversations on social media, natural language is one of the richest and most underutilized sources of data. Not only does it come in a constant stream, always changing and adapting with the context, it also encodes information about our relationships, beliefs, intentions, and motivations that cannot be conveyed through traditional data sources. Natural language processing is the work of translating human speech and writing into machine-readable content to generate analytics and insight – tasks such as stemming and lemmatization, part-of-speech tagging, and vectorization.
Once raw text is transformed into numeric feature vectors, we can leverage it to perform all kinds of valuable analytics, from extracting keyphrases for better text searching to sentiment classifiers that ascribe relative positivity and negativity scores to text.
At ByteCubed, we often work closely with domain experts to identify places where our team can make everyday tasks more efficient, scalable, and objective. If an administrative assistant must manually review many requests and decide how to route them, we can train a model using historic records to predict the appropriate audience for a request based on its content. For the analyst who must scan through thousands of news articles each week to find the most relevant stories, we can build a recommendation system that performs content-based and cognitive filtering, sifting through many more articles than any person could read, and highlighting the most important ones.
Unsupervised machine learning
Most of the data that comes through our office isn’t labelled, meaning it’s not well-suited for traditional classification methods. As a result, we frequently employ unsupervised machine learning techniques to discover patterns and themes. Sorting through an enormous pile of unlabeled text documents might seem like a Sisyphean task for one person, but our data science team can use latent topic extraction to find hidden themes within and across many documents, or hierarchical clustering to discover natural groupings even without a priori categories.
One of the most challenging tasks – whether with structured or unstructured data – is effectively modeling interconnections between entities (e.g. people, organizations, companies, products, places, etc.). For these problems, we often leverage network graphs, since they provide a data structure that naturally models information in terms of nodes connected by edges. By extracting graphs from data, we can immediately leverage the features of that structure – from looking at the overall structure to identify social hubs or weak spots, to using computational methods like centrality to find key players, to using neighborhood blocking to performing entity resolution more efficiently.
End-to-End Data Science
At ByteCubed, we see data science not as isolated or discrete analytic tasks, but as a complete, end-to-end pipeline that includes data ingestion, wrangling, analysis, modeling, and application. We build data products that are designed to take in massive data sets and generate richer, more comprehensive data in return to extract valuable insights.
For this reason, we believe that data scientists make the most impact when they’re part of the engineering team. That means we’re responsible for producing high quality, readable code, using agile development, version control, and continuous integration, writing our own unit tests, and upholding configuration management protocols.
Our goal is to inform, empower, enhance, and enrich the products our tech team develops for our customers. While the tools we wield may seem complex and specialized, ultimately, we strive to make our processes intuitive, transparent, and repeatable, to ensure their value is always clear.