Brainforge Final Interview
Date: March 12, 2026 Source: Granola Meeting ID: a877e8ec-fcea-4561-af50-20cfb6946b97 URL: https://notes.granola.ai/t/a877e8ec-fcea-4561-af50-20cfb6946b97
Participants:
- Uttam Kumaran (note creator) from Brainforge uttam@brainforge.ai
- Awaish Kumar from Brainforge awaish.kumar@brainforge.ai
- Gilbert Adjei gilbertadjei800@gmail.com
- Demilade Agboola from Brainforge demilade.agboola@brainforge.ai
Technical Solution Walkthrough
- End-to-end data pipeline demonstration using production-style implementation
- Data ingestion via Airbyte from JSON exports to Postgres
- Transformations using dbt with staging, intermediate, and mart layers
- Full CI/CD automation with GitHub Actions
- Dockerized environment for reproducibility
- Airbyte setup using abctl (Kubernetes in Docker via kind)
- Sources: JSON files → Destination: Postgres in Docker
- PG Admin on port 5050 for graphical database queries
- dbt architecture rationale:
- Raw schema: Immutable data from sources (airbyte user access)
- Staging: Light transformations, casting, renaming (materialized as views)
- Intermediate: JSON expansion and nested data handling
- Mart: Business metrics and analytics (materialized as views/tables based on BI tool usage)
- RBAC implementation with role-based access control
- Finance role → finance-related data only
- Operations role → operations-related data only
- “NO LOGIN” keyword creates roles vs users in Postgres
Performance & Optimization Discussion
- Model performance troubleshooting approach for slow queries (10-15 seconds → 56 minutes):
- Analyze query efficiency and unnecessary joins
- Check materialized data for proper indexing
- Investigate upstream model changes (expected 1K records → receiving 1M+)
- Review query structure and optimization opportunities
- Indexing strategy for 500M+ row tables with BI dashboard filters:
- Primary key indexes for unique row identification
- Foreign key indexes for fact-to-dimension table joins
- Date/time indexes for common analytics filtering
- Validation via “EXPLAIN ANALYZE” before/after performance comparison
- dbt materialization for large datasets:
- Switch to incremental materialization for 10+ minute model runs
- Use unique keys (single field or composite) for proper incremental processing
CI/CD & Production Deployment
- GitHub Actions workflow (runs every 6 hours):
- Ubuntu environment with Python dependencies
- Postgres container with secrets from GitHub repository settings
- Python script replicates Airbyte data ingestion for ephemeral database
- dbt runs prod target with full testing suite
- Alerting integration recommendations:
- Teams/Slack integration with version control
- Proactive failure notifications with direct links to error details
- AI tool usage philosophy:
- Internal team equipped with Cursor, ChatGPT accounts
- Focus on core job functions vs memorizing syntax/commands
- Client empowerment through AI-integrated BI tools (Omni platform)
- Efficiency gains enable handling 2-3x more client projects
Transcript
Me: Hello. Them: Hi, everyone. Great. All right. So thank you all so much. For joining in. I’m trying to turn on my camera. Me: Good. Them: All right. Anyway. Taking a letter a minute anyway, so thank you so much for making it. For this walkthrough. So I’m also going to share my screen. And then please let me know if you’re able to see my screen as well. Okay? Me: Yes, you can see it. Them: Awesome. Great. So this is going to be a walkthrough of an end to end demonstration of the take home assignments and how it’s worked. Right from data ingestion app until the CI process and then the decisions that governed the choices of tools and things of that sort. And we see the way forward. So this solution is production style implementation that ingest the data that was shared, which is the JSON exports and then it exports them to saves them into postgres and transform them with dbt. And then the whole solution is also governed clearly with CI automations. And terms of that sort. So the data ingestion was done by using a byte again transformations DBT and rbac. We’ll look at it and then also finally look at the GitHub actions and within. Readme. I had also included how to set these things up. And the whole system is being dockerized so that it can be reproducible in various environments. Now talking about the Airbytes, which is a data ingestion tool, what I did was to use abctl, which essentially create. Kubernetes in Doca, right? So it uses kind to create kubernetes cluster in Docker and then ensures that all the different parts within Airbyte has been installed and then set up correctly. So referencing. The documentation it gives outlines how to do that. So I set it up this way. And again, the prerequisite for this project is for you to have Docker environment running, for you to have Python, and then the rest are all within the readme. So, as you can see, This is running on Port 8000. And it has the different files. Being set up as connections. I’ll go into the code base briefly, but to show demonstrate how it works, we have the sources which have these files and then the destination, which is Postgres, which is also running in Docker. Once the connection has been established between the files and then the postgres database, which is running in the docker compose file. So this is the whole setup of their project. So we have docker compose file, which has the postgres running with PG admin. PG admin in this case is for you to graphically query your database. I spend as app and then trans on port 50 50. So this is how the various services. Interact in the docker. Compose file now. Moving back to Airbytes. So I’ve shown the destination and then the sources. So once you want to move data from the source to the database. So over here, I just clicked on one source. And this process is repeatable, right? We can always reproduce the environment. So you can click on sync now and then you can see that the data will move from the source, which are the files that were shared into the postgres database. Now, to validate that this has indeed gone into the database, Once the database is initialized, there are certain scripts that are run. So these scripts. Give the various initiate the schemas and then the rules. So the four different rules that exist, the developer by DBT and then the bi. And then for each of them, they have what they do. So as defined within the readme. So for raw air bytes, it’s responsible for creating or ingesting the data into the postgres database. So it creates that and then the schema raw. Over here, as we can see. Create. Once it’s successful, it’s a successful push. It moves the data from airbyte and then creates these three tables stored under the raw schema. So this is how the raw schema. Using the AirByte user create. The various tables within the schema. Now, when that happens. The data, we can validate that it’s all here so we can query it. As I mentioned earlier on to graphically query your data, I spend up a Docker container of PG admin so that you can run various commands like if you want to take a quick peek of the data. So let’s just say I want to. So it creates the raw data and then airbytes automatically and at this useful field towards. Right. So that if you want to perform any other things on it. But this raw data in itself comes according to the metadata has primary keys in terms of that sort. However, once that has been ingested, the pipeline uses DBT to perform various transformations on the data. So the DBT we have the staging, which essentially cast the raw data, performs some casting on it, and then. Me: Yeah, perfect. Them: Ensures that renaming and things of that sort. Initially the data came within upper cases, so the staging environment takes the raw data from the raw schema and then also performs these light transformations on them. And within DBT I added various tests to it so that the various models are properly tested and you don’t have surprises. When we move into production. So how we run that I have defined them within the readme. However, I can do a quick show of how to run the staging targets, do dbtran target and then the staging and this will run all the models within. The folder. And it’s good to mention that I also created macro. So macro it’s more like a function, and then it helps you to reuse code so that I don’t have to be repeating myself within the DBT environment. Now, when that is done. We have a DBT user which has the ability to create tables. It has ability to also modify the staging environment and then also yes, but. However, the data that is within the that is ingested within the raw schema, DBT user cannot make modifications to that. It is immutable because we want DBT to focus on the transformations, and then we want the data that comes from the various sources to also remain the same. So that’s with regards to the DBT part. Also, since the data came in with JSON field and all that, I created an intermediate. Stage which expand. Me: Yeah, I guess my question is, talk us through why each of the different stages and models. Them: Okay? So I added the different stages. So I have the raw and then a staging intermediate and a math ticks in the data from the source, right from the source and then ingest into the raw schema. The staging performs light transformations like renaming the fields and then custom fields and then I chose intermediate because the intermediate stage handles JSON side of the data. So the data came in in a form of nested jsons and all that, so that transformations happen within the intermediate stage. And then the MAT performs the business side business case. So if there are metrics that need to be generated, the MAT handles the analytical business cases. So that was why I separated the folder structure in this way. Yeah. My follow up question is why we kept the JSON b like JSON columns as JSON b in the staging instead of splitting it into separate columns. So in the staging. I wanted to maintain things as they were because within the staging, I just wanted to do minimal transformations, like casting and also renaming of fields of that sort. And then within the intermediate stage, I would rather work on the JSON field. So that was why I kept the staging field, the staging stage in that way. Okay? So moving on the tests run. When you do DBT test and then staging and then it will run the YAML files. So ensuring that the customer ID for it to be a primary key, it should not be null, should also be unique. While you were writing these tests. Did you come across any business related tests which, like instead of just a standard ones, Not null unique. Okay, so within. The business side of things. I wanted to create it. I wanted it to be minimal, right? Ensuring that the green is one rule. Per order. I created that for the standard ones. Yes. And then I also. You can name, for example, if you have anything in mind. Like what else we can create. Yes. So, for example, we can, for example, if we want to see fulfillment status, we want to see whether. Let’s meet a particular threshold, we can add that to the test as well. Right. So you can always include a test based on the business case and let’s say a total amount, expecting the amount not to be greater than, say, 100,000. I can also add that expression as a form of a test in here. Okay? Yes. And then aside these ones to if the data were to be coming from a live source, you would want to test for freshness of the data, right? So that the data wouldn’t be still and all that. However, this particular data is just the adjust files and so. If they are live data. You want to test for data freshness and ensure that your data isn’t still. Yes. So those are. Some of these models are materialized as views. Is there any reason for that? Yes. So for standardization of layers. I would prefer that they are saved or they are materialized as views and then for layers, data often hit by downstream tools like the BI tools and things of that sort that should be materialized as tables. So in situations where performance matters. You would want to save those matter in terms of that sort as tables. And then for the staging part, you want to save it as views. No, no. I mean, the mods table is also materialized as a view. What do you think? Like, should it be a view or a table? Yes. So what I’m saying is that in this case, it’s on a case by case basis. Scenarios where we have bi tools interacting more with the model. I would want it to be materialized as a table, especially when performance really matters. Okay? If we go back to that mart order summary. Model. Yeah. So in that I can say it’s only joining orders. With order line items. But it is missing information regarding. The customer information or the products information. Yes. So this is a metric that I defined. So I think within the take home it was more like you can have metrics, you can define your metrics and then it have it going into a match. So that’s a symmetric that I opted to define. But however. Yes, if there are metrics that would need to combine all the different tables, it’s something that can be joined in over here. Since we already have this accessity, we can easily bake it in. Any other questions regarding dbt. Otherwise, I want to ask something about databases. For me. So my, my question with dbt would be. Or not DBT per se. But how would you control. Can you. How would you control access to specific smart for instance? So if you were trying to do some data governance and ensure that only like fine, so right now it’s only like matter summary. But if we had finance, we had marketing data, how would you ensure that people that saw it were the only appropriate people. Sure. So you can use tax to help you with that. And then also within the rbac, so if you have departments and then you grant a user is from a particular department like finance. All users inheriting from the finance role should be able to see things that are pertaining or related to finance. And then you can do a similar thing for those in operations, right. So all users within Operations Team and are inheriting roles from Operations should be able to see events pertaining to Operations only. Okay, thank you. Yeah. Can we see the roles? Rbac file. Yes. Yeah, like. If, like, can you open somewhere where it’s creating the roles? Like, what is this? No other rules. Okay. The rows. Yes. So what is this, like, keyword? No, Logan means here. Okay, so what that means in postgres is that this particular one is a role and not a user. So another flavors you can do something like create user and then the name of the user. However, in postgres, the syntax is such that if you want it to be more like a role and how it differs from a user, you use the keyword no login. Yes. Okay? And how do you define these variables in your initialization of the project? The ones that are being used at the bottom, like dev user password and the airbyte user password. Yes. So those are being passed in from. So if I go to. Yes. So I already have how this is being run. And then so once this is being run, I have. So on my. Show how things are done. When it’s in deployment. But on my local, you can set these variables in here, right? So that it can point. The credentials, however. In production. How I do that is I use the secret. So let me come here. And then. Go to settings. And then over here. So secrets and variables. Actions. And then over here. So this is how in production, I set my secret so that it’s being ingested in there, and then I don’t need to. Add or push my secret to. Okay, Yes. Okay, I don’t see. Like workflow files in the wrapper. Yes. So when I submitted it as part of. The email response. One of the things that I realized was that. That’s one is Brainforge assessments. Me: Yeah. Them: Right? Yeah. So one thing I realized was that when I push this and then I want. I wanted to create a PR out of it. It showed me. So let me even show you how it’s looking like. They told me that you cannot create a pr. Out of this. And then I added it as part of the email. Okay. But you can show me the actions you are running right now. Yes. So this particular one. That particular one was done in a private repo. This one when I pushed it in this particular report, I was only able to create a branch. I wasn’t able to create a pr. It just told me that it wasn’t. So I created a private repo. And then just to demonstrate the end to end solution. And then where this particular one I’m able to do a PR and then to show how this whole process works. Can you walk me through the. The yml file? For those actions. Yes. So. So we have that just for fraud. I created two files, One for prod and then one for staging. So, as I had shown earlier, The credentials I had shown how they are being installed in. Git. And then you can access them using the secret dots and then the the name of the secret. So how port works is I have scheduled this to run every six hours. So it uses the Ubuntu base. Os. And then I have put this into various stages, steps. So after it spins up Ubuntu, it installs Python and then it installs all the necessary dependencies for this to run. And then, as I mentioned, it’s extract the secret from the Repos secret so that it can use it to spin up Docker image. Right. It pulls a docker container, postgres docker container, and then sets the credentials using that. From Git. And once that is done, We give it some minutes to fully initialize and create the necessary schema that is needed. And then also since I was creating the workflow in GitHub. We needed to ingest the raw data, right? So we needed to make it similar to how airbytes was running locally and within this take home, we had to set airbytes locally and then also store ingest the data. So I wrote a Python script that picks the data and then ingested into this ephemeral database that has been created. And then the SENS DBT had already been installed. It’s initializes the working directory, which is the same as the one that has been on the project, and then installs all the dependencies that are needed and set the various targets and then the various environment variables. And when that is done, it runs. The prod and then target and then also test it and in order to see it. Whether it’s working or not. You go to webflow over here. So once any PR is pushed or. Yes, So I intentionally made it to fill and also made it to pass so that we see how things work. So to see it in action. Again, once you push it will automatically trigger it. And then when you go to usage, you see how. It worked. All the different steps. So setting up the job, the dependencies, the postgres container, the schema dependencies and all that. So when it click into run DBT Yes. So. Transfer target prod. And then the secrets have been hashed out. And then, so that we don’t have any form of leakage genes, and then the system is secure from any form of attacks. And then when all this is done, it indicates that it has passed. And then the various models that need to be run have been run. That’s how it runs. And it also executes the test as well. That’s how n running production. Also, if we’re going to set up. Alerting. How would you set up alerting? Yes. In fact, it’s one of the things that maybe as a next step would be great to do. So typically what I do when it comes to alerting is if we are using teams, I integrate teams with the version control so that if it fails automatically, it will send in a teams alert. So if it is slack to integrate that integration with version control and then when it fails, you can proactively click on it. And see where the failures are happening and then fix them. So that’s how I will do it. Yeah. So why I see, like we use air byte because we want to ingest using a byte instead of writing custom scripts. Yes. But in the action I can still see. The step to load the data. What is that? Yes. So. This whole process is. So if it’s being dockerized, it needs the raw data. To work on it, the DBT to work on it. Now, for Git action to work on it, it means that this environment needs access to the data. And since the postgres has not been deployed to a live server. We cannot assess the data from there. Right. And still we need data somehow to be in postgress before DBT can run on them. So that’s why we spend up the postgres database, and then we needed to ingest the data into postgress before we run. The dbt on it. Because Postgres hasn’t been deployed to a live server. Yes. Okay? Me: I guess my question is going to be like, let’s say you notice that one of the models goes from taking, like, 10, 15 seconds. To, like, 56 minutes. Walk me through some of your investigation that you commonly think you would do, sort of see what’s going on. Them: Yes. So my first thing that I’ll do is to look at the model that is failing and then also look at the queries that are running it. So if there are a lot of unnecessary joins and then. It means that there are some inefficiencies in there which needs to be worked on. And I would also look at what are the data and model the query itself. Once it has run the materialized data has some form of indexes on it or. If it doesn’t, then it means that probably things are not being done. Well, I’ll look at the upstream. Models that are being run and see if there are some new changes that are coming in and that is causing the downstream model to delay unnecessarily. So for example, if I’m expecting thousand records from an upstream models and then it enter say 1 million or 10 million, I know that maybe something may have changed upstream. So I’ll also investigate further to see whether things have changed upstream. But most importantly, it’s about the efficiency of the queries that have been written within that particular that build that particular model. Let’s build on top of that, since you mentioned you could add indexes, so let’s take an example. Like we have mart order summary table and. It grows to like 500 million plus rows and our queries really becomes very slow. And then we have a downstream dashboard, which is like visuals, filters on order date, customer ID and financial status. So, like what exactly? What kind of what indexes you would consider first on that table and why? So if I understand you. The indexes that can be done on the math data. Smart order summary. And that is being used by bi tool. Which basically uses filters on order date, customer ID and financial status. And the carry is really slow, so it takes a lot of time to load on the BI tool. So what indexes would you consider first? And, and why would you consider that? Okay? So first of all, I’ll consider the primary. Key index. And then also it ensures that the row is unique and then makes the lookup very fast. And then I also look at the foreign key index. Right? So if there are fact tables that often references, say, dimension tables, these forms of indexing can really speed up the joints. And then also since there may be some date time within the data I would also look at. It’s very common within analytics. In fact, since maybe you may be filtering by date ranges sentence like that, I also create indexes on that as well. Okay, and then how would you validate index usefulness and avoid over indexing? Now you validate. I’m asking for you to repeat. Me: Oh, yeah? How would you validate that? The indexes worked. And that you didn’t over index, add too many or. Them: Yeah. So normally when you add index, you would want to run a query like explain analyze before and after. So if it’s taking a long time, let’s say you do explain analyze and you do select star from. Something, a table. So you customize with customer ID 100 and then you see how it’s run, and then you compare the time execution, right? So if the first one takes like one minute after applying the index and you run the same, and then you see that the time has dropped significantly, and then you see that yes, the indexing worked. And also you can use that database in this statistics as well to check normally. I’ll give you explain, analyze. To confirm that the time before and then the time after has a lot of differences. Have you ever worked with Redshift? Yes, I have bad hours. A long time. So let me ask you that question, because I just wanted to see, like, your understanding of source keys, but that’s fine. Yeah. Since. Since this model now have grown to 500 million plus rows. It is possible that if I create this table, It is going to take like maybe 10 minutes to run on a DBT. So what like DBT materialization strategy you will change to cater that. Can you please take that again? So, for example, we have a table, the model summary. Because of now it. There’s a lot of data. It takes, like, a lot of time to run or execute it. Like maybe, for example, you can say it takes 10 minutes to execute one model. So what can I change in my DBT materialization strategy? So I can make it faster. Yeah. So I think. I think that if it’s slowing down and then we want to make it very fast in dbt, it. Comes. First of all. Make it incremental on top of my head. Since. Yeah, you can make it incremental, and then when the model is very large, And then if there are new or change grows that needs to be processed using incremental can really help you with things that are very large like this. And if you’re going to do incremental, how would you build it? Would you like. How do you ensure that you are incrementing the data properly? Sure. So normally I would add to something like this. If you are still seeing my screen, I’ll just do something like this. Config and then define its. Materialize, view materialized, incremental, and then the unique key. I’ll just use something like the order id. So something like this. Quick question just to follow up that if it doesn’t have a unique key, what would you use to. Increase. It doesn’t have a unique key. Combine, say, two fields and then make it as a unique two or more field and then which uniquely identify the rule and then use it as a key. Me: Cool. I know. We’re just coming up on time. I think this is great. I mean, I think a pretty good understanding of sort of how to set up each of the components locally. I think running through DBT was great. I’m wondering. I want to leave enough time for your question, but I’m kind of interested in your feedback on the exercise and what part was interesting. Or if any part was tough, just like your reflection. Them: Yes. So I think with the exercise it was great because it was more practical. Because it’s something that you would see normally. If you are working with clients in terms of that sort. One of the things that interestingly was a challenge for me when I real was that what I saw was that my SSH key had for some reason been corrupted. So I had worked on it and then when I was pushing to gate, I realized that something had been corrupted, so I had to use quickly use AI tool to speed up things, right? So I didn’t need to memorize Git commands to set up ssh key sentence like that, which I felt that was very helpful to me. And leveraging this AI tools can help you move fast and you can focus more. On the broad objective of. Saying, helping clients or making revenues go up and things like that, and then, as opposed to memorizing things and then making yourself slow. So that was my challenge, and that was how I was able to overcome it with these AI tools. Me: What questions do you have for us, Gilbert? Them: Yes. So, on the topic. Me: We also didn’t do a round of introductions. But that’s fine either way. Yeah, but tell me what questions you have for us. Them: Yes. So I’ve been reading and I’ve been following Brainforge progress as a data consultant. Company, and you’ve been helping different clients and all that. So how have you seen them embrace the usage of AI? And then how do you also use that internally as well? To maybe speed up. Where can things like that. Me: Yeah. Is anyone else? Does anyone want to take that? Them: I mean, I can take it. Me: Yeah. Them: So internally to answer that. We highly encourage the use of AI. Not just highly. We try to make everyone on the team both engineering and like, sales and every, like, literally everyone on the team use AI. We set them up with cursor, we have chat, GPT account, and the idea is we want people to focus on the core aspect of their jobs rather than the nitty gritty details. So if you can get AI to get you to do that faster, that is great. And also because of that, we then try and use that. To empower our clients as well. So in terms of being able to use AI power tools, Set them up with. Like, for instance, we like to use a tool called Omni for, like, their bi, because Omni integrates AI very well. And so now the people, the stakeholders can go in there, ask the questions they need to ask without always coming back to us for, like, small things, like, how many orders do we have in 20, 20? 5. Like, that’s not something we. We want to spend our hours doing for them when you can just use AI to find that for you. So, yeah, to answer your question, yes. We if we use and we leverage AI highly. That’s impressive. That’s impressive. And it’s good to know that the client are also opinion about it. And yeah, in fact, this whole landscape is mind boggling. You can be very efficient. However, you should also know what you are doing and make sure that. Me: Exactly. That second piece is what matters more than anything. Because we all did it manually for so long. Them: Yeah. Me: Yeah. Them: Yeah. That’s great that I think you’ve answered my question. Me: Yeah, I think maybe my question is, like, tell me about what you’re looking forward to next in terms of next job or next role in data. Maybe something that you didn’t previously get a chance to do. Or like a way that you’re thinking about. I would love to grow in this new domain or new part of the stack. Them: Yeah. So what I would like to do more. In fact, resonated with what. Nami said, is to leverage these tools that will make you more efficient. In the past, I was using a lot of engineers were using stack overflow. And currently I don’t remember the last time I went to Stack Overflow for Djose. Me: Me neither. Yeah, maybe. It’s like three years, two years. Them: Yeah. Me: Yeah. Them: Yeah. So I want to leverage the use of these tools. However, I need to understand things, how they work holistically and then how they can speed up work for not only internal teams also, but for clients. Because once you are more efficient with your processes, if you are closing in on one client, maybe you can double that. You can work on three client projector in terms of that sort and can still be very efficient. So that’s what I’m really looking out for using these tools and then also most importantly, being able to help grow business revenue line. So that’s what I’m really looking out for in my next role. Me: Cool guys. Any other questions? Them: No, not for myself. Nothing from you as well. Me: Okay? Perfect. Thank you so much, Gilbert. I appreciate you taking all the time to work on exercise as well. And I’m kind of going to detail today, so it’s really helpful. Them: Okay. Thank you. Thank you, everyone. Me: Okay? Perfect. Them: Bye. Me: Thank you so much.