Lessons learned and key takeaways from implementing DataHub

JoshuaG
3 min readOct 31, 2023
Photo by Ross Sneddon on Unsplash

Perhaps it is better to describe my open-source DataHub journey backwards.

Today our organization has a reliable open-source data catalog with various customizations to accommodate Data Governance intake. Data analysts are able to understand the impact of a change before it’s deployed to production, and anyone in the org is able to quickly search through the many business glossaries within the platform to find a reliable definition that makes sense to their particular line of business.

From a technical perspective, our current implementation of DataHub is robust enough to facilitate new feature previews, sustain a healthy deployment workflow and allow for end users to validate core changes to the platform without interruption to our productive system. I am fortunate enough to work with a small group of very capable engineers who are critical to the maintenance and customization of this platform. All of this would not be possible without the foresight and understanding of my senior leadership.

Now that we’ve covered all the glitz and glamour, let’s dive into the challenges we’ve faced to get us to where we are now.

Data Governance

— Why are we doing this?

As much as I would like to geek out on the specific benefits of AWS EKS, or Helm Chart deployments, a data analyst will not care about these implementation details. Today’s data stewards rely on DG teams for specific answers regarding impact analysis of upstream view changes, lineage graphs or just a simple business glossary. If we lose sight of the benefits of an enterprise data catalog and focus on the technology first, we could lose sight of the overarching needs of our organization to conduct day-to-day business.

A core capability of DataHub is to serve as an enterprise data catalog by leveraging well established metadata management principles. Data teams are further empowered to deliver value to the organization by promoting a better understanding of their particular data sets. Data engineers are given the tools to get a handle on impact analysis of changing views, transformations and models.

DataHub aims to serve a fundamental role by leveraging metadata management principles, enable data observability and discovery, as well as serve as a core component of federated data governance. The complexity of these initiatives cannot be overstated. Here are some other questions which may help in planning: How does DataHub support the Data Governance program? Would it make sense to scale down the deployment to a more consumable size? How will we train users in the tool? What about deployment office hours to facilitate the transition?

Engineering

— How to overcome roadblocks

There will be many roadblocks on your path to productionalizing DataHub and a solid understanding of the core components will help in planning. The following should be carefully planned:
1. Cloud platform (AWS, GCP, Azure)
2. Kubernetes cluster configuration (Choosing a managed service on AWS helps to reduce complexity of deployments)
3. Application specific behavior (Pods, Ingress, SSL, Authorization, User Provisioning)
4. Management of application and underlying components (CI/CD integration, Helm chart deployments, upgrade timeline, user provisioning)
5. Scope for the deployment (Initial data teams being targeted, transition details, engineering support).

Teamwork

— You can’t go it alone

Going it alone (application specific or integration with existing infrastructure) will greatly impact deployment timelines. Spreading the core deployment knowledge with a team will help reduce bottlenecks. Working with existing IT teams will be an absolute requirement to ensure your final environment is properly secured, performs well, and properly supported. Start those conversations and planning as early as possible.

Conclusion

— Follow the plan, and if there isn’t one, create one.

As noted earlier, deploying a new data governance tool is a complex endeavor and ensuring a smooth transition, high adoption rates and overall effectiveness will hinge on various factors. Keep in mind overall requirements for the tool, stick to the overall DG program and try not to overemphasize the technology, rather the benefits to the business.

--

--