Meeting Title: [Place Holder] BF: Final Interview – Architecture Discussion Date: 2025-11-18 Meeting participants: Ashwini Sharma, Henry Zhao, Awaish Kumar, Uttam Kumaran


WEBVTT

1 00:04:24.130 00:04:25.589 Henry Zhao: Hey, Ashwini, how’s it going?

2 00:04:26.180 00:04:27.000 Ashwini Sharma: Hello.

3 00:04:29.120 00:04:31.069 Henry Zhao: Guess we’re waiting on a few other people, I believe.

4 00:04:31.070 00:04:32.960 Ashwini Sharma: Hi, Henry, yeah. How are you?

5 00:04:34.090 00:04:35.079 Henry Zhao: I’m good, thanks.

6 00:04:35.450 00:04:38.690 Ashwini Sharma: Alright, yeah, let’s wait for some more time.

7 00:04:39.320 00:04:39.920 Henry Zhao: Okay.

8 00:05:17.380 00:05:24.620 Henry Zhao: In the meantime, do you have any questions for me that I can address? Anything you want to know about Brain Forge, anything like that?

9 00:05:24.910 00:05:29.740 Ashwini Sharma: No, I think I’ve got a good amount of details from Utam, and…

10 00:05:30.010 00:05:33.999 Ashwini Sharma: And the other two gentlemen that I talked with, Avish and,

11 00:05:34.410 00:05:38.310 Ashwini Sharma: Delayed, I think, maybe I’m pronouncing it wrong.

12 00:05:39.610 00:05:40.760 Henry Zhao: Demo Area? Yeah.

13 00:05:40.760 00:05:43.210 Ashwini Sharma: A demilade, yeah, sorry, yeah, a Demolade.

14 00:05:44.180 00:05:48.739 Henry Zhao: So maybe while we’re waiting, since I haven’t interviewed yet, can you just tell me a little bit about yourself?

15 00:05:48.770 00:06:06.799 Ashwini Sharma: Oh, yeah, yeah, sure, sure. Yeah, so I currently work as a data architect in a mortgage servicing company, right? In a mortgage servicing company called BSI Financial Services. It’s a company, the parent company is based out of USA, and then there’s an India unit,

16 00:06:06.980 00:06:22.230 Ashwini Sharma: And I work out of India. So what I do here is, like, I take care of directly the ingestion part, the data transformation part, and the data visualization part. And, like, along with this, the horizontal layer of data governance, data quality, right?

17 00:06:22.510 00:06:26.010 Ashwini Sharma: All those things, like, so, yeah.

18 00:06:26.160 00:06:33.009 Ashwini Sharma: That’s what I’ve been doing. Previously in my role, I have worked at Fivetran, where I was creating data pipelines.

19 00:06:33.120 00:06:38.300 Ashwini Sharma: Ingesting data from multiple different sources into different warehouses.

20 00:06:39.560 00:06:40.150 Henry Zhao: Okay.

21 00:06:41.110 00:06:41.980 Ashwini Sharma: Hi, Avaesh.

22 00:06:46.030 00:06:46.800 Awaish Kumar: Hello?

23 00:06:48.710 00:06:50.050 Awaish Kumar: How are you doing?

24 00:06:50.050 00:06:51.159 Ashwini Sharma: I’m good, how are you?

25 00:06:52.200 00:06:53.979 Awaish Kumar: I’m good as well.

26 00:06:55.200 00:07:01.349 Awaish Kumar: Yeah, so… In this meeting, like, we had a little discussion in the last interview.

27 00:07:01.350 00:07:03.689 Ashwini Sharma: Regarding, like.

28 00:07:03.690 00:07:10.479 Awaish Kumar: understanding your background. In this meeting, we are going to more talk about, like, architectural level,

29 00:07:10.980 00:07:16.270 Awaish Kumar: Decisions, or how would you appro… like, what kind of different pipelines you have worked on.

30 00:07:17.740 00:07:22.240 Awaish Kumar: But, like, what tools you have used, how complex those were.

31 00:07:22.360 00:07:29.249 Awaish Kumar: what were the data volumes? So we are going to be, going to go in depth on these things.

32 00:07:29.340 00:07:30.060 Ashwini Sharma: Okay.

33 00:07:31.060 00:07:35.030 Awaish Kumar: And, yeah, like, maybe Henry will also have a few questions.

34 00:07:35.450 00:07:36.489 Awaish Kumar: Oh, maybe I’ll know.

35 00:07:37.120 00:07:39.140 Awaish Kumar: analysis part,

36 00:07:39.940 00:07:46.519 Awaish Kumar: Or, yeah, maybe some of the tools you have, maybe, maybe you have worked on. So, yeah.

37 00:07:46.730 00:07:49.179 Awaish Kumar: That’s, like, the plan for today.

38 00:07:49.870 00:07:58.199 Ashwini Sharma: Sure, so, I don’t know, the email said something about Figma doing some.

39 00:07:58.830 00:08:06.779 Awaish Kumar: Yeah, that’s more, like, that’s a way to explain. If I, like, if I ask you some questions, and or if we…

40 00:08:06.880 00:08:16.300 Awaish Kumar: If you want to explain something, and it requires some visualization, you can use Figma or any diagramming tool if you need.

41 00:08:16.550 00:08:18.760 Ashwini Sharma: Got it, okay, okay, got it, yeah, sure.

42 00:08:19.000 00:08:19.880 Ashwini Sharma: Alright.

43 00:08:20.680 00:08:25.399 Awaish Kumar: Okay, I hope you already met each other and hadn’t intro’d each other.

44 00:08:25.400 00:08:26.590 Ashwini Sharma: We did, we did, yeah.

45 00:08:26.590 00:08:27.999 Henry Zhao: Yeah, we have. Thank you.

46 00:08:28.880 00:08:29.450 Ashwini Sharma: Thanks.

47 00:08:30.540 00:08:43.069 Awaish Kumar: Yeah, so let’s dive in then. If you can just start talking about a really complex data pipeline you have worked on in any of your

48 00:08:43.260 00:08:46.119 Awaish Kumar: Last 2-3 jobs.

49 00:08:48.700 00:08:55.769 Ashwini Sharma: Yeah, I can talk about some of the pipelines, right? Like, let me talk about one of the, you know.

50 00:08:56.370 00:09:03.100 Ashwini Sharma: one of the difficult pipelines that I had to work with. This was at Fivetran, and

51 00:09:03.270 00:09:15.929 Ashwini Sharma: If you’ve used Fivetran, there is something called NetSuite Connector, NetSuite Suite Analytics, right? This is one of the connectors which I built personally, right? And

52 00:09:16.060 00:09:26.719 Ashwini Sharma: The problem with this connector is that, you know, it works fine, it works seamlessly, right, when there is lesser volume of data. But as soon as the volume of data increases.

53 00:09:26.830 00:09:32.030 Ashwini Sharma: We started getting complaints from the customer saying that there is data integrity issues, right?

54 00:09:32.140 00:09:35.520 Ashwini Sharma: And, what I mean? Hey, Utam.

55 00:09:36.330 00:09:37.189 Uttam Kumaran: How are ya?

56 00:09:37.190 00:09:38.709 Ashwini Sharma: I’m good, how are you?

57 00:09:38.790 00:09:54.260 Ashwini Sharma: So we’re just talking about NetSuite Connector, and Avish is asking me some questions on some of the complicated… so, yeah. And the issue with this connector was that, you know, as I said, there is data integrity issues reported in some of the customers, right? And…

58 00:09:54.300 00:10:05.800 Ashwini Sharma: The way this connector works is based on the timestamp-based cursor, right? So, basically, when we do an extraction, we save the state at what point we left.

59 00:10:06.090 00:10:18.279 Ashwini Sharma: Extracting from a certain table, and then when we do a second round of extraction after maybe 15 minutes, or half an hour, or whatever is the frequency set, right, we pick up from that point onwards.

60 00:10:19.270 00:10:35.079 Ashwini Sharma: And, yeah, so the data integrity issues could not be explained by any means, right? Because it was unpredictable, it could not be replicated, right? And it was not happening for everybody, right? Only in certain cases it was happening.

61 00:10:35.280 00:10:41.220 Ashwini Sharma: And we never knew what cases it would happen. It would just face out of nowhere. So what we started doing was, like.

62 00:10:41.530 00:10:51.249 Ashwini Sharma: we started adding a buffer to the extraction, timestamp, right? So if we… if we did extraction, and it stopped at, let’s say, 1PM,

63 00:10:51.310 00:11:11.040 Ashwini Sharma: Next time, extraction would start at 1250, and then we gave it 10 minutes of buffer, thinking that, okay, maybe there will be some kind of a late-arriving records in the… in a replica instance of NetSuite from where we are pulling out the data, right? And then that should resolve these issues of missing records somewhere.

64 00:11:11.340 00:11:14.079 Ashwini Sharma: It worked fine for… sorry, yeah, go ahead.

65 00:11:14.290 00:11:18.239 Awaish Kumar: Yeah, sorry, I just want more, like, I want to understand, like, what different

66 00:11:18.810 00:11:23.850 Awaish Kumar: What the architecture was like for these kind of connectors, and…

67 00:11:24.000 00:11:30.950 Awaish Kumar: What kind of tools were being used, how those were working together with each other.

68 00:11:31.250 00:11:39.520 Awaish Kumar: What were the different components of the systems, and how, basically, an end-to-end pipeline worked.

69 00:11:39.930 00:11:43.860 Ashwini Sharma: Right, so this is, okay, at that level, right? Okay.

70 00:11:43.860 00:11:56.170 Uttam Kumaran: Yeah, maybe, also, Ashwini, while you’re talking, sorry, I was just getting off another client call, but we actually set up a little bit of a FigJam board. Maybe if you want to do it visually, that could be also very helpful.

71 00:11:56.170 00:11:59.099 Ashwini Sharma: Sure, sure, yeah. Can you send me a link to that?

72 00:12:05.620 00:12:08.140 Uttam Kumaran: And we’ll… I just put in some,

73 00:12:09.230 00:12:15.190 Uttam Kumaran: some diagrams here, but yeah, I guess to kind of set the stage, we just wanted to walk through a little bit of, like.

74 00:12:15.400 00:12:23.029 Uttam Kumaran: sort of seeing the end-to-end pipeline and kind of seeing how you… how you think about it with… I don’t know if you’ve used Big Jam before, but it’s…

75 00:12:23.280 00:12:29.770 Uttam Kumaran: you just have, like, all the common sort of diagrams. I can bring in some more,

76 00:12:30.720 00:12:34.539 Uttam Kumaran: Like, technical diagramming, little widgets, if you want.

77 00:12:35.200 00:12:37.780 Awaish Kumar: You can open the link and share your screen.

78 00:12:37.780 00:12:40.959 Ashwini Sharma: Yeah, yeah, once in, I’m just getting familiar with the tools here.

79 00:12:40.960 00:12:41.410 Uttam Kumaran: abruptly.

80 00:12:41.410 00:12:45.360 Ashwini Sharma: in a… Where is the share button?

81 00:12:46.440 00:12:47.940 Ashwini Sharma: Oh, right over here.

82 00:12:48.900 00:12:53.800 Ashwini Sharma: Entire screen content as well. What is this, Google Chrome?

83 00:12:54.600 00:12:55.500 Ashwini Sharma: This one?

84 00:12:57.260 00:13:00.170 Ashwini Sharma: Oh, system settings…

85 00:13:20.440 00:13:22.150 Ashwini Sharma: Are you able to.

86 00:13:22.610 00:13:24.469 Awaish Kumar: Yeah, yeah, we can see your screen.

87 00:13:24.470 00:13:25.450 Uttam Kumaran: Yes.

88 00:13:25.450 00:13:26.890 Ashwini Sharma: Okay, alright.

89 00:13:26.890 00:13:30.689 Uttam Kumaran: So, there’s some… there’s some diagrams here, then I also put some here on the bottom.

90 00:13:33.070 00:13:37.790 Uttam Kumaran: So yeah, I guess if you want to just use, if you scroll down a little bit, you’ll see some more…

91 00:13:38.210 00:13:41.229 Uttam Kumaran: yeah, so you can use any…

92 00:13:41.230 00:13:42.869 Ashwini Sharma: Oh, okay, okay.

93 00:13:46.900 00:13:53.240 Ashwini Sharma: Sure. So, this is how connectors work at Fivetran, right? Yeah, I can just explain you this.

94 00:13:59.280 00:14:03.819 Ashwini Sharma: So basically, Fivetran maintains an internal DB, right? This is, like…

95 00:14:08.020 00:14:18.670 Ashwini Sharma: this is where we store all the metadata about, what we are extracting, right? Different connectors, right, everything stores in one, DB.

96 00:14:21.090 00:14:29.290 Ashwini Sharma: And, do we, go and create some services.

97 00:14:30.120 00:14:34.719 Ashwini Sharma: So, Fivetran is a monolithic application, right? So,

98 00:14:35.460 00:14:42.089 Ashwini Sharma: basically, it’s just one image that we create, and that image runs in a GKE cluster, right?

99 00:14:42.280 00:14:46.099 Ashwini Sharma: So… Let me represent it,

100 00:14:49.800 00:14:54.010 Ashwini Sharma: But that image has multiple services within it, right? And…

101 00:14:54.510 00:14:56.929 Ashwini Sharma: One of the services is basically

102 00:15:05.540 00:15:06.200 Ashwini Sharma: Right.

103 00:15:06.410 00:15:11.329 Ashwini Sharma: And this extraction service is sort of, you know, has access to

104 00:15:11.450 00:15:16.050 Ashwini Sharma: this DB through mechanisms like, state management.

105 00:15:25.140 00:15:27.610 Awaish Kumar: Do you use any kind of tools like Airflow?

106 00:15:28.190 00:15:29.980 Awaish Kumar: Or anything, orchestration?

107 00:15:31.180 00:15:50.909 Ashwini Sharma: No. So, the orchestration was inbuilt over here, right? And, what it does is, like, in the internal DB, we used to capture the details about each connector, right? So, in that details, we also capture what is the frequency at which this connector is supposed to execute. So, every time.

108 00:15:51.090 00:16:05.140 Ashwini Sharma: it kind of scans all the connectors that’s available in the DB, and then figure out what are the connectors to be scheduled right now, right? So if any connector is executing right now, it does not get scheduled. If,

109 00:16:05.700 00:16:24.880 Ashwini Sharma: if something is scheduled, but it’s not yet… it’s time to run, that does not get scheduled. But other than that, it pulls out all of these, and then it will run each connector in a Kubernetes pod, right? So what this extraction service does is it will connect to the

110 00:16:24.940 00:16:32.079 Ashwini Sharma: different sources, right? So… Let me add a source one. Let’s say this is NetSuite.

111 00:16:33.100 00:16:39.469 Ashwini Sharma: And then… This might be, let’s say, Salesforce.

112 00:16:47.150 00:16:49.509 Ashwini Sharma: And how does this do? Is…

113 00:16:50.710 00:16:53.750 Ashwini Sharma: You know, there is a state management service.

114 00:16:55.650 00:16:59.290 Ashwini Sharma: Which manages the state about each connector on this DB.

115 00:17:00.390 00:17:04.380 Ashwini Sharma: The other is, the credential service.

116 00:17:08.319 00:17:13.330 Ashwini Sharma: Which, you know, manages the credentials that we have stored for each of the connectors.

117 00:17:13.589 00:17:32.790 Ashwini Sharma: So, when you’re creating a connector in Pipetran, it may be possible that you provide a user ID and password for a DB-like connector, but for a SaaS application, it’s just OAuth, right? And in case of OAuth, you just save those tokens, a refresh token and access tokens, right? So, all of these… ignore my spelling mistakes, please.

118 00:17:33.110 00:17:37.439 Awaish Kumar: Yeah, but what that service looks like, like, is it a… what is that? Is it a…

119 00:17:37.440 00:17:45.969 Ashwini Sharma: These are all Java-based services, right? It’s just, you know, Java objects that are interacting with the DB using JDBC connections.

120 00:17:47.390 00:17:48.130 Awaish Kumar: Okay.

121 00:17:48.130 00:17:48.740 Ashwini Sharma: Okay.

122 00:17:49.950 00:17:53.179 Ashwini Sharma: So, extraction service is, is sort of,

123 00:17:53.430 00:17:57.170 Ashwini Sharma: I’d put it as, like, horizontal, right, like this.

124 00:18:01.270 00:18:09.380 Ashwini Sharma: And so, the extraction service will get details about connections that are, you know, credentials that are required to access these sources.

125 00:18:11.460 00:18:12.779 Ashwini Sharma: Oh, maybe…

126 00:18:16.730 00:18:20.679 Uttam Kumaran: And the extraction service has the info on, like, schedules and stuff like that.

127 00:18:20.680 00:18:27.200 Ashwini Sharma: No, it doesn’t have. So, there is another application called, this was called… sorry.

128 00:18:29.190 00:18:30.229 Ashwini Sharma: Oh, come on.

129 00:18:36.220 00:18:36.940 Ashwini Sharma: Yeah.

130 00:18:37.730 00:18:40.100 Ashwini Sharma: This is called the scheduler application.

131 00:18:40.580 00:18:41.280 Uttam Kumaran: Okay.

132 00:18:42.270 00:18:44.019 Ashwini Sharma: Also called Babysitter.

133 00:18:44.530 00:18:45.250 Ashwini Sharma: Right?

134 00:18:45.540 00:18:55.519 Ashwini Sharma: And this application had access to all the schedules, right? And this used to trigger, you know, a job, basically, in the GKE cluster.

135 00:18:55.850 00:18:58.650 Ashwini Sharma: Which executes the image of these services.

136 00:18:59.420 00:19:01.000 Ashwini Sharma: So now, yeah.

137 00:19:01.330 00:19:02.790 Ashwini Sharma: Any questions?

138 00:19:05.900 00:19:11.249 Awaish Kumar: I was just saying that if you can connect the arrow, so we can know the flow.

139 00:19:17.350 00:19:22.220 Ashwini Sharma: This should be two-way connectors, I’m not sure if I can make it two-way.

140 00:19:23.350 00:19:27.860 Uttam Kumaran: Yeah, if you right-click the arrow, itself.

141 00:19:28.210 00:19:31.640 Uttam Kumaran: And then, you can click.

142 00:19:33.100 00:19:36.470 Uttam Kumaran: this one.

143 00:19:39.020 00:19:39.810 Ashwini Sharma: This one?

144 00:19:40.420 00:19:41.090 Ashwini Sharma: No.

145 00:19:41.380 00:19:43.099 Uttam Kumaran: There it is.

146 00:19:47.260 00:19:48.920 Uttam Kumaran: Yeah, you can book this.

147 00:19:49.770 00:19:51.600 Uttam Kumaran: And… This one here.

148 00:19:51.830 00:19:53.739 Ashwini Sharma: I’ll just draw… okay.

149 00:19:54.270 00:19:55.410 Ashwini Sharma: Sorry, which one?

150 00:19:55.880 00:19:58.770 Uttam Kumaran: If you click on that, and then click on the second to the left.

151 00:19:59.070 00:20:00.440 Uttam Kumaran: Second from the left, yeah.

152 00:20:00.440 00:20:01.230 Ashwini Sharma: This one?

153 00:20:01.670 00:20:03.580 Ashwini Sharma: Oh, okay, okay, yeah.

154 00:20:05.270 00:20:08.489 Ashwini Sharma: So basically, you know, the request goes from,

155 00:20:13.710 00:20:14.490 Ashwini Sharma: Yeah.

156 00:20:20.040 00:20:25.170 Ashwini Sharma: Yeah, the extraction service reaches out to this source, whether it’s a

157 00:20:25.310 00:20:36.240 Ashwini Sharma: you know, HTTP request, or whether it could be a JDBC kind of request, depending on different, different sources, right? And then the data starts flowing in, right? And what this does is,

158 00:20:38.230 00:20:44.319 Ashwini Sharma: It will… You know, it is going to pass it over to the updater service.

159 00:20:58.010 00:21:10.129 Ashwini Sharma: So the updater service is going to accumulate this thing over a, you know, in the memory itself, and once it’s… the data reaches a certain volume, it’s going to write it into S3, so…

160 00:21:11.230 00:21:13.660 Ashwini Sharma: Do we have a S3 image somewhere?

161 00:21:15.420 00:21:16.430 Awaish Kumar: Well, you can just write.

162 00:21:16.430 00:21:18.650 Ashwini Sharma: Oh, I’ll just write it, yeah. Okay.

163 00:21:31.810 00:21:39.459 Ashwini Sharma: Initially, it was S3, later we changed it to GCS once, once we moved from AWS Infra to GCP Infra.

164 00:21:41.700 00:21:42.620 Ashwini Sharma: -Oh.

165 00:21:47.750 00:21:55.480 Ashwini Sharma: Right? And then comes, the data writer service.

166 00:22:03.180 00:22:05.240 Ashwini Sharma: This service…

167 00:22:05.840 00:22:16.630 Ashwini Sharma: the updater service will kick off this one, right? Once… once it figures out that this, the files that it has been writing in GCS has reached a certain point, right?

168 00:22:16.740 00:22:18.579 Uttam Kumaran: It will trigger this one.

169 00:22:18.580 00:22:23.640 Ashwini Sharma: And then this, is going to write into… warehouse.

170 00:22:32.780 00:22:39.989 Awaish Kumar: Yeah, so it is the same flow for warehouse, or even if it is, like, reverse ETL, or writing to an

171 00:22:40.520 00:22:41.990 Awaish Kumar: SaaS platform.

172 00:22:42.570 00:22:50.390 Ashwini Sharma: So Fivetran does not have a reverse ETL, kind of… it’s not a… it’s a one-sided flow of data, right?

173 00:22:51.130 00:22:55.420 Ashwini Sharma: I’m not sure if they’ve added that feature since 2000…

174 00:22:56.340 00:23:04.719 Ashwini Sharma: 22 when I left there, 21 or 23 years ago. Yeah, because they bought… they got census, so I think they started to add it. Okay.

175 00:23:05.480 00:23:06.000 Ashwini Sharma: Yep.

176 00:23:06.000 00:23:08.520 Uttam Kumaran: But, yeah, back then, there wasn’t any, yeah.

177 00:23:08.520 00:23:09.719 Ashwini Sharma: Right, yeah.

178 00:23:10.190 00:23:14.110 Ashwini Sharma: So, yeah, this is what… what used to happen, right?

179 00:23:14.430 00:23:21.609 Ashwini Sharma: And… Okay, so the scheduler is the one which triggers these flows, like, based on a schedule.

180 00:23:21.980 00:23:23.579 Ashwini Sharma: Yeah, based on schedule, yes.

181 00:23:24.110 00:23:29.749 Ashwini Sharma: And it was an inbuilt, it’s not Airflow or Dexter or anything like that, it’s just inbuilt,

182 00:23:29.890 00:23:32.290 Ashwini Sharma: Scheduler that… that does the job.

183 00:23:33.350 00:23:36.490 Awaish Kumar: So, if you can talk a little bit about the scale of the…

184 00:23:36.730 00:23:43.930 Awaish Kumar: data, scale of the different connectors, like, the, like, 510 must have many clients.

185 00:23:43.970 00:23:44.780 Ashwini Sharma: Yeah.

186 00:23:44.780 00:23:49.969 Awaish Kumar: How do you… how the infrastructure was able to handle that workload?

187 00:23:50.570 00:23:53.829 Awaish Kumar: And maybe for non-functional requirements, like.

188 00:23:54.100 00:24:01.309 Awaish Kumar: How these services were made sure to run smoothly, like, fault-tolerant and the reliable.

189 00:24:02.690 00:24:11.020 Ashwini Sharma: Yeah, so basically, like, you know, all these services, as I said, it’s running on a… GKE cluster.

190 00:24:11.200 00:24:21.200 Ashwini Sharma: So, that has auto-scaling enabled, so anytime there is, multiple connectors to be scheduled, it just scales up, right? We never had any issues with,

191 00:24:21.350 00:24:27.529 Ashwini Sharma: with this. Earlier, when we were on AWS, that time it was just, running processes on

192 00:24:27.730 00:24:42.669 Ashwini Sharma: on EC2 instances, right? And in those cases, there were issues where we would, this was in the early days, right? Where we would find that certain connectors ran out of memory, right? It could not scale up, so in those cases, what we used to do was

193 00:24:42.720 00:24:54.730 Ashwini Sharma: We used to have a configuration in that internal DB, where against each connector, we would, if it had to be overridden, right, and scaled on a larger EC2 instance, we used to turn that flag on.

194 00:24:54.730 00:25:10.699 Ashwini Sharma: And, that would schedule it on a larger EC2 instance. So, it added more memory, more CPU, and it would scale… that was how we were scaling it. But using this Kubernetes, we never had to, you know, deal with any of these things.

195 00:25:12.080 00:25:18.190 Awaish Kumar: No, like, but for example, extraction Service is the one which is like,

196 00:25:18.610 00:25:21.720 Awaish Kumar: Basically, extracting the… extracting the data.

197 00:25:23.190 00:25:29.999 Awaish Kumar: So you are saying that service itself is on a KWS NATS cluster, and it is Spread out.

198 00:25:30.750 00:25:37.120 Ashwini Sharma: Yeah, the entire thing works as a single image, right? It’s not like a… it’s not a, you know…

199 00:25:37.480 00:25:40.039 Awaish Kumar: Okay. What do we call it? Like.

200 00:25:40.870 00:25:47.249 Ashwini Sharma: service-oriented architecture, where one service is calling another service. All of them are tightly packed within a single image.

201 00:25:47.880 00:25:58.859 Ashwini Sharma: And they work together, right? So, like, when I’m running a job, right, all that thing runs as a single process within the JVM.

202 00:25:59.790 00:26:00.490 Awaish Kumar: Okay.

203 00:26:00.870 00:26:01.820 Awaish Kumar: Understood.

204 00:26:02.880 00:26:03.600 Ashwini Sharma: Yep.

205 00:26:05.200 00:26:15.959 Ashwini Sharma: And, the… yeah, so basically, right, once… once the file size reaches around 1.5GB, that’s when the data writer service kicks in, right?

206 00:26:15.960 00:26:26.479 Ashwini Sharma: And then it kind of merges the data into the warehouse. When it is writing into the warehouse, it generally writes into a staging table, and then does an upsert with the main table.

207 00:26:30.770 00:26:31.620 Ashwini Sharma: Right?

208 00:26:31.760 00:26:35.059 Ashwini Sharma: What else can I talk about over here?

209 00:26:36.820 00:26:42.420 Awaish Kumar: I see that it is a zoomed-in view of an… single job.

210 00:26:43.430 00:26:46.359 Awaish Kumar: Let’s a bit zoom out, and…

211 00:26:46.760 00:26:50.380 Awaish Kumar: Like, how… how the infrastructure looks like?

212 00:26:50.760 00:26:52.220 Awaish Kumar: on a…

213 00:26:52.750 00:26:58.660 Awaish Kumar: like, the, the, like, we have a… we have a Kubernetes cluster where all these are in a single…

214 00:26:58.910 00:27:05.160 Awaish Kumar: image, all… this full flow is running inside an image, and that is running for…

215 00:27:05.160 00:27:09.660 Ashwini Sharma: Except for this one, right? This one runs outside. It doesn’t run in the image, yeah.

216 00:27:10.330 00:27:15.799 Awaish Kumar: So, like, is this single point of failure? Like, what if this service fails?

217 00:27:16.980 00:27:20.869 Ashwini Sharma: If this service fails, it fails, right? So, another node will pick it up.

218 00:27:21.310 00:27:30.479 Ashwini Sharma: it gets rescheduled after, you know, if it fails, if the connector fails, let’s say, right? If the cluster itself goes down, right, if the cluster…

219 00:27:30.580 00:27:46.439 Ashwini Sharma: Yeah, if the cluster itself goes down, then that’s a different case, right? Where we’ll have to, you know, manually come and restart the cluster, or, you know, take action. But if a pipeline fails, like, it’s one job that is failing, right? When one job fails.

220 00:27:46.540 00:27:52.859 Ashwini Sharma: We’ll mark it as failed, and then the next run, it’ll start again. Now, it could fail due to multiple reasons, right? It could fail because

221 00:27:53.120 00:28:10.779 Ashwini Sharma: the pod crashed, or the node crashed, right? Or it could fail because the pipeline itself stopped working, because there was an issue with the code, or whatever it is, right? Generally, what happens is it will try for SaaS-based connectors, right?

222 00:28:11.600 00:28:19.739 Ashwini Sharma: it will try sending requests, multiple times, right? And if it fails to receive any data, in those cases, it’ll throw an exception.

223 00:28:20.200 00:28:24.779 Ashwini Sharma: So, it’s a valid failure, right? And.

224 00:28:24.780 00:28:32.159 Awaish Kumar: Yeah, that’s… yeah, my question was more like, as we said, schedule service is the one which is scheduling all the jobs, right? Yeah.

225 00:28:32.510 00:28:37.540 Awaish Kumar: And that is a single service, which is responsible for running all the tasks.

226 00:28:38.000 00:28:42.010 Awaish Kumar: So, if that service… goes down.

227 00:28:42.010 00:28:42.650 Ashwini Sharma: Yeah.

228 00:28:42.960 00:28:48.840 Awaish Kumar: the system… Then we’ll stop executing the task, until that’s… that’s…

229 00:28:49.100 00:29:00.069 Ashwini Sharma: So that… even that scheduler service is deployed on a Kubernetes cluster, right? If that goes down, again, another will spin up, and then it will ensure that it’s always… at least one node is always running.

230 00:29:01.130 00:29:01.840 Awaish Kumar: Okay.

231 00:29:04.410 00:29:08.070 Ashwini Sharma: That was the reason, like, they wanted to move out of,

232 00:29:08.270 00:29:14.779 Ashwini Sharma: this EC2 instance. At that point, I’m not sure. When we moved out from AWS to

233 00:29:15.160 00:29:19.450 Ashwini Sharma: GCP. I don’t think that EKS was available,

234 00:29:20.560 00:29:24.810 Ashwini Sharma: On AWS, but I’m not sure. It may have been, but it was not that popular.

235 00:29:25.910 00:29:30.740 Awaish Kumar: Okay, and, like, now, if you can, like, briefly, give…

236 00:29:31.110 00:29:39.629 Awaish Kumar: an overview of, like, the choices of the tools you made, like, for example, database. What kind of database were being used, and why, and…

237 00:29:40.110 00:29:43.189 Awaish Kumar: Same for the different…

238 00:29:44.470 00:29:50.050 Ashwini Sharma: So, at Fivetran, I didn’t have to make any choices, right? Because these were already, you know, already there.

239 00:29:50.050 00:29:50.520 Uttam Kumaran: Yeah.

240 00:29:50.520 00:29:52.909 Ashwini Sharma: And I just have to use it right here.

241 00:29:53.450 00:30:02.340 Uttam Kumaran: Yeah, so I think I want to go probably, like, a little bit of a different direction, like, maybe we can start just from a example scenario for, like, an example client of ours.

242 00:30:02.340 00:30:05.620 Ashwini Sharma: Right. You know, we have several clients that, when they start with us.

243 00:30:05.620 00:30:10.390 Uttam Kumaran: They’re getting a directive, to establish, like, a data stack.

244 00:30:10.460 00:30:12.979 Ashwini Sharma: And maybe you can walk us through…

245 00:30:13.070 00:30:24.080 Uttam Kumaran: thinking through that problem and, like, what the architecture could be. And let me give you a little bit of situation on, like, what the scenario is. So assume you have a variety of different sources.

246 00:30:24.080 00:30:25.880 Ashwini Sharma: database sources.

247 00:30:25.900 00:30:44.730 Uttam Kumaran: let’s just say, like, Postgres, you have, also, like, marketing and e-commerce sources, right? So maybe, like, think about it, and you have, like, maybe finance and sales, CRM-related sources. The goal, you know, for the architecture is to, you know, set

248 00:30:44.820 00:30:46.810 Uttam Kumaran: Get access to all of those.

249 00:30:46.950 00:30:52.630 Uttam Kumaran: land it into a data warehouse, and then make it available through data marts for reporting.

250 00:30:52.690 00:31:11.439 Uttam Kumaran: So maybe we can sort of walk through that flow, and in particular, like, I can… I can… I can ask some questions, sort of about how you’re thinking about setting up each part of the infrastructure, and then totally, if you’re… if you’re able to also give us some insight into, like, why you’re choosing certain tools for each part, that would be… that would be great.

251 00:31:11.760 00:31:12.620 Ashwini Sharma: Yeah, sure.

252 00:31:14.170 00:31:19.500 Ashwini Sharma: Okay, so, multiple sources, and, like…

253 00:31:19.730 00:31:23.640 Ashwini Sharma: Am I allowed to use a… tool like Fivetran, or…

254 00:31:23.640 00:31:24.959 Uttam Kumaran: Yeah, yeah, yeah, totally.

255 00:31:25.250 00:31:28.000 Ashwini Sharma: Oh, okay, okay, in that case, like, yeah.

256 00:31:33.700 00:31:40.099 Uttam Kumaran: Yeah, start with whatever tools, and recommendations, and then we’ll kind of give you some edge cases that we can talk through.

257 00:31:40.850 00:31:41.640 Ashwini Sharma: Shop.

258 00:31:56.410 00:31:58.369 Ashwini Sharma: Okay, I’m just going to take one or…

259 00:32:02.200 00:32:03.850 Ashwini Sharma: You said,

260 00:32:04.810 00:32:12.659 Ashwini Sharma: Okay, so we have a… yeah, let me just add variety to it, right? S3, which is CSV-based sources.

261 00:32:15.950 00:32:17.499 Ashwini Sharma: What else can I add?

262 00:32:49.660 00:32:56.530 Ashwini Sharma: Alright, so within this tool, we have all the scheduling configuration that, does the extraction for us.

263 00:32:56.780 00:33:03.730 Ashwini Sharma: Right? Scheduling configuration, connection configuration, all the configuration that is required for us to extract the data out of these sources.

264 00:33:04.030 00:33:06.780 Ashwini Sharma: And push it into, you know, some…

265 00:33:08.880 00:33:17.210 Ashwini Sharma: some basic location, right? So, one of the options that… that could be… is, we specify it does…

266 00:33:18.480 00:33:19.570 Ashwini Sharma: S3?

267 00:33:24.950 00:33:28.240 Ashwini Sharma: Right? Or it could be, you know,

268 00:33:32.340 00:33:36.180 Ashwini Sharma: Directly as, tables in warehouse.

269 00:33:53.490 00:33:58.389 Uttam Kumaran: Yeah, so then talk to me a little bit about your choice for S3.

270 00:33:59.640 00:33:59.960 Ashwini Sharma: Yeah.

271 00:33:59.960 00:34:01.059 Uttam Kumaran: at the warehouse.

272 00:34:01.060 00:34:03.179 Ashwini Sharma: So, so, yes,

273 00:34:05.540 00:34:22.600 Ashwini Sharma: Yeah, so S3, I choose, yeah, so basically, like, there are times where, you know, I’ve seen, you know, leadership thinking about, okay, we don’t want to, you know, get locked with a certain vendor, right? We want to remain,

274 00:34:22.750 00:34:40.330 Ashwini Sharma: with open data format as long as possible, right? So, for example, if somebody is using Snowflake, right, but they still have thoughts that, okay, maybe in future we want to get rid of Snowflake and go with a different warehouse. In that case, I would not go directly into Snowflake and make Fivetran write data into there.

275 00:34:40.330 00:34:44.509 Ashwini Sharma: Let’s go with S3, Iceberg Catalog, Open Data Format.

276 00:34:44.830 00:34:49.839 Ashwini Sharma: keep the tables over there, and then use Snowflake to query it as external tables.

277 00:34:49.969 00:34:54.150 Ashwini Sharma: But if the situation is sort of like, okay.

278 00:34:54.520 00:34:58.339 Ashwini Sharma: We have deal with Snowflake, we want to, you know, directly use it.

279 00:34:58.460 00:35:06.110 Ashwini Sharma: then we can directly go with the tables and, you know, let 500 by the ETLE write tables into Snowflake directly.

280 00:35:06.950 00:35:07.520 Uttam Kumaran: Okay.

281 00:35:14.860 00:35:16.160 Ashwini Sharma: This is…

282 00:35:31.890 00:35:33.750 Ashwini Sharma: This is one-way arrow.

283 00:35:33.980 00:35:36.590 Ashwini Sharma: Sorry, I don’t know how to eliminate it.

284 00:36:01.470 00:36:05.850 Ashwini Sharma: This is one-way error, okay, matt.

285 00:36:20.800 00:36:23.240 Uttam Kumaran: Yeah, so if you click on the arrow,

286 00:36:23.440 00:36:26.669 Uttam Kumaran: And you just… if you go to the one on the right here.

287 00:36:26.670 00:36:27.310 Ashwini Sharma: Yeah.

288 00:36:27.810 00:36:30.250 Uttam Kumaran: Yep, so that’s… you could put there.

289 00:36:30.250 00:36:30.880 Ashwini Sharma: Yeah.

290 00:36:33.240 00:36:35.909 Uttam Kumaran: So you’re gonna… you’re, you’re gonna land in both places?

291 00:36:36.570 00:36:41.190 Ashwini Sharma: No, it’s going to be either-or, right? It’s not both the places, it’s either-or.

292 00:36:41.830 00:36:42.370 Uttam Kumaran: Okay.

293 00:36:43.220 00:36:48.959 Awaish Kumar: Okay, and what, like, what would be the case for, like, putting it in both, right?

294 00:36:49.720 00:36:55.459 Ashwini Sharma: There won’t be a case for putting it in both. However, one client will be using only one option, not both options.

295 00:36:56.830 00:37:01.300 Awaish Kumar: Like, yeah, like, what about, like, data lake architecture?

296 00:37:03.800 00:37:12.219 Ashwini Sharma: Well, I mean, you can always follow this, right? If you want to go for a data lake. You can always use the warehouse to read this data.

297 00:37:12.510 00:37:14.010 Ashwini Sharma: Right? This is…

298 00:37:14.010 00:37:23.539 Uttam Kumaran: what we’re trying to say is, like, that’s the architecture in where you’d have both, right? Not landing in both, but the warehouse would be reading from S3.

299 00:37:23.540 00:37:27.680 Ashwini Sharma: Yeah, right, right, right. Yeah, so this is just,

300 00:37:28.210 00:37:31.540 Ashwini Sharma: Okay, I should maybe make it in a different way.

301 00:37:32.310 00:37:33.260 Ashwini Sharma: Let me…

302 00:37:36.370 00:37:39.250 Ashwini Sharma: Let me do like this, and then… right?

303 00:37:40.020 00:37:40.990 Ashwini Sharma: This is one-off.

304 00:37:40.990 00:37:44.179 Uttam Kumaran: So, yeah, like, the S3 as iceberg is basically optional.

305 00:37:44.480 00:37:46.820 Ashwini Sharma: Yes, and…

306 00:38:18.540 00:38:19.550 Ashwini Sharma: Alright, yeah.

307 00:38:19.890 00:38:25.900 Ashwini Sharma: And in this case, there will be an arrow from Yeah.

308 00:38:26.250 00:38:27.160 Ashwini Sharma: Which is…

309 00:38:27.290 00:38:37.039 Ashwini Sharma: I’m trying to indicate that, you know, the warehouse is actually reading external files, and these files are exposed as external tables into the warehouse.

310 00:38:37.490 00:38:42.229 Ashwini Sharma: This is the case, this, this, one is the case where

311 00:38:43.510 00:38:53.290 Ashwini Sharma: you know, customer has already committed to a warehouse, you know, we’re not going to move anywhere else. This is what we’re going to use. We need our queries to run super fast.

312 00:38:53.960 00:39:08.520 Ashwini Sharma: And this is the situation, right? This is the situation where, you know, we’re going to use this warehouse, but yeah, maybe in future we just want to migrate to something else. We don’t want our data to be locked in into a… with a certain vendor, right? That’s the case.

313 00:39:09.330 00:39:10.399 Ashwini Sharma: So now…

314 00:39:10.400 00:39:11.050 Uttam Kumaran: So my…

315 00:39:11.050 00:39:11.640 Ashwini Sharma: Yeah.

316 00:39:12.200 00:39:15.580 Awaish Kumar: And how would you like to… the data warehouse.

317 00:39:15.970 00:39:16.490 Ashwini Sharma: Buddy?

318 00:39:16.780 00:39:20.680 Awaish Kumar: If you can briefly talk about how would you architect the data warehouse?

319 00:39:20.680 00:39:24.700 Ashwini Sharma: Yeah, yeah, I’ll, should I draw it in a different way, right?

320 00:39:24.920 00:39:31.320 Ashwini Sharma: So basically, the way it will work is… I’ll take Snowflake as an example, and then talk about it, right?

321 00:39:31.320 00:39:32.540 Uttam Kumaran: Snowflake’s fine, yeah.

322 00:39:32.540 00:39:33.060 Ashwini Sharma: Yeah.

323 00:39:33.750 00:39:36.590 Ashwini Sharma: So, yeah.

324 00:39:37.230 00:39:38.909 Ashwini Sharma: Let’s talk about this one, right?

325 00:39:54.210 00:39:56.549 Ashwini Sharma: The raw layer is where the data

326 00:39:57.360 00:40:08.869 Ashwini Sharma: gets landed, data lands into the raw layer, or basically the… all the external tables that we are creating, right? They are created in the raw layer, right? This is the layer

327 00:40:09.660 00:40:12.590 Ashwini Sharma: Which… you know.

328 00:40:13.270 00:40:20.329 Ashwini Sharma: let’s say all developers, or 90% of developers will not have access to, right? This is how I want to structure it.

329 00:40:20.530 00:40:21.750 Ashwini Sharma: And,

330 00:40:28.480 00:40:32.359 Ashwini Sharma: So on top of this layer, we’ll… I’ll keep a base layer.

331 00:40:35.010 00:40:37.610 Ashwini Sharma: This… this consists of views.

332 00:40:38.430 00:40:42.809 Ashwini Sharma: That, read from… Tables in raw air.

333 00:40:43.880 00:40:52.009 Ashwini Sharma: Right? Why I’m doing this? This is because I don’t want developers to be accessing raw layer, right? Nobody should query the data in the raw layer.

334 00:40:52.150 00:41:09.850 Ashwini Sharma: The views that are there in the base layer basically read from the raw layer, but this gives me one layer of indirection, right? Supposing my raw layer is going to have, for compliance reasons, like, millions of records, last 10 years of records. We don’t need all of that in

335 00:41:10.030 00:41:16.409 Ashwini Sharma: To… to… to be used in modeling, right? For modeling, maybe you want last 2 years of records.

336 00:41:16.610 00:41:23.210 Ashwini Sharma: So, in a way, this view prevents developers from issuing queries, that…

337 00:41:23.480 00:41:27.469 Ashwini Sharma: you know, scans the entire table in raw layer, right?

338 00:41:27.610 00:41:30.910 Ashwini Sharma: Yeah.

339 00:41:31.850 00:41:39.199 Ashwini Sharma: So this is where the dbt will start operating, right? This is where dbt start reads, will read, data from.

340 00:41:39.620 00:41:40.680 Ashwini Sharma: And…

341 00:41:52.590 00:41:56.350 Ashwini Sharma: Stage layer where all… Leaned.

342 00:41:57.380 00:41:59.560 Ashwini Sharma: Staged data on the side.

343 00:42:02.090 00:42:07.209 Ashwini Sharma: On top of this will be… The mart layer.

344 00:42:22.280 00:42:31.920 Ashwini Sharma: It consists of smart layer, dimensions, facts, and, aggregate tables, right? On top of this, I can have one more layer, which is called the semantic layer.

345 00:42:36.390 00:42:39.279 Ashwini Sharma: Which contains the semantic views.

346 00:42:44.380 00:42:45.660 Ashwini Sharma: definitions.

347 00:42:46.750 00:42:56.119 Ashwini Sharma: Yeah, this is how it will… and data governance will be, you know, across the various layers, depending on who is reading, you know.

348 00:42:57.150 00:43:03.120 Uttam Kumaran: So talk to me about the semantic layer, and, you know, I think there’s…

349 00:43:04.770 00:43:20.409 Uttam Kumaran: I guess there’s probably, like, I think two… yeah, if you… let’s say your customer was like, okay, I’m familiar with marts and staging, but, like, what’s the purpose of the semantic layer? It seems like you’re overcomplicating it. Like, what do you… what do you… how would you… how would you explain the…

350 00:43:20.550 00:43:22.610 Uttam Kumaran: The benefits or the importance of it?

351 00:43:22.800 00:43:36.380 Ashwini Sharma: Yeah, so semantic layer will standardize the definitions across, this, MartLayer, right? So MartLayer has what… it has tables like FCT something something, right? Or DIM calendar, or DIMM, you know.

352 00:43:36.670 00:43:44.619 Ashwini Sharma: listings, or suppliers, or things like that, right? And that definition is not consistent across different

353 00:43:44.720 00:43:50.869 Ashwini Sharma: aspects of the business, right? Business might know an entity using a different name.

354 00:43:50.870 00:44:06.979 Ashwini Sharma: And when you define that name in that semantic layer, everybody understands that, okay, this is the entity that we are talking about, right? Not only that, it helps, the semantic layer will help, you know, when you use it together with the Cortex AI, or

355 00:44:08.320 00:44:17.539 Ashwini Sharma: similar AI tools. It will help the business users to issue queries in natural language and get answers to,

356 00:44:18.730 00:44:27.309 Ashwini Sharma: answers from the underlying data, right? So they don’t have to be techy enough to write a SQL or make joints on the… on the

357 00:44:27.660 00:44:28.729 Ashwini Sharma: BI dashboard.

358 00:44:31.330 00:44:31.860 Uttam Kumaran: Okay.

359 00:44:33.380 00:44:36.920 Uttam Kumaran: And you would… you recommended that all of that lives in Snowflake?

360 00:44:39.750 00:44:56.799 Ashwini Sharma: Yes, so in this architecture, everything is living in the Snowflake, right? But if we are going through, like, iceberg, right, most of the tables would stay in the S3 itself, and maybe only the definitions will stay in the Snowflake and metadata.

361 00:44:59.620 00:45:00.130 Uttam Kumaran: Okay.

362 00:45:00.130 00:45:00.800 Ashwini Sharma: So…

363 00:45:03.190 00:45:03.720 Awaish Kumar: Okay.

364 00:45:03.720 00:45:08.029 Ashwini Sharma: But that’s terribly slow, I think, you know, when all the…

365 00:45:08.220 00:45:19.069 Ashwini Sharma: the Snowflake’s micro-partitioning strategy, that’s brilliant, at least. I mean, I don’t have the numbers, but if we query external tables versus

366 00:45:20.110 00:45:31.550 Ashwini Sharma: Querying internal tables in Databricks, it’s far faster, right? The querying of internal tables. And it’s the same case in Snowflake also, but I don’t have the numbers to, you know.

367 00:45:32.140 00:45:34.189 Ashwini Sharma: Give an approximate comparison.

368 00:45:35.740 00:45:36.300 Uttam Kumaran: Okay.

369 00:45:41.090 00:45:44.800 Uttam Kumaran: Okay. I guess my next question is gonna be sort of like,

370 00:45:45.100 00:45:54.290 Uttam Kumaran: about, like, naming conventions within each layer. Like, how do you think about naming models and schemas…

371 00:45:54.450 00:46:05.739 Uttam Kumaran: You know, to improve, like, the readability and accessibility of tables. Do you kind of have a sort of thought process in your mind? And then my follow-up question is going to be a little bit about, like.

372 00:46:06.240 00:46:11.400 Uttam Kumaran: the development environment itself. But maybe you could talk about the first… the first one.

373 00:46:12.530 00:46:13.650 Ashwini Sharma: Sure,

374 00:46:14.210 00:46:21.509 Ashwini Sharma: Yeah, so, these two layers will have, okay, in this one, right, let’s talk about how, we are…

375 00:46:25.890 00:46:33.300 Ashwini Sharma: So, in this layer, the naming convention would be… the raw layer is the name of the catalog, the database, right?

376 00:46:34.510 00:46:36.310 Ashwini Sharma: This is just my name.

377 00:46:37.540 00:46:42.779 Ashwini Sharma: Since I don’t have more context right now over here, but depending on the customer, this might be different.

378 00:46:42.900 00:46:51.320 Ashwini Sharma: And the data will be organized based on source, Followed by, table name.

379 00:46:54.880 00:46:55.950 Ashwini Sharma: Nevermind.

380 00:46:56.090 00:47:01.209 Ashwini Sharma: So, for example, let’s say, let’s say source is… is, NetSuite, right?

381 00:47:03.010 00:47:05.840 Ashwini Sharma: And table could be transactions table.

382 00:47:06.260 00:47:06.950 Ashwini Sharma: Right.

383 00:47:07.050 00:47:10.340 Ashwini Sharma: Or still, salesforce.

384 00:47:10.940 00:47:12.190 Ashwini Sharma: Customers.

385 00:47:12.340 00:47:15.840 Ashwini Sharma: Stuff like that, right? And the same…

386 00:47:19.370 00:47:28.489 Ashwini Sharma: this is going to have the same structure. There won’t be any difference at all in what we have at the source, except for the view created, right, which.

387 00:47:30.100 00:47:36.049 Ashwini Sharma: Which might add some restrictions on the volume of data to pull surface, in this one.

388 00:47:36.220 00:47:38.599 Ashwini Sharma: In the staging layer,

389 00:47:39.220 00:47:44.169 Ashwini Sharma: So, like, okay, maybe I should talk about how I have modeled the data, right?

390 00:47:44.500 00:47:46.340 Ashwini Sharma: On the dbt side.

391 00:47:46.580 00:47:50.360 Ashwini Sharma: The way that I like to do it is,

392 00:48:04.160 00:48:06.950 Ashwini Sharma: So, at the top, we have models, right?

393 00:48:07.080 00:48:14.779 Ashwini Sharma: And within each model, I like to keep things separate, right? So, for example, let’s say I’m working on a model called procurement, right?

394 00:48:17.100 00:48:24.570 Ashwini Sharma: So… I will keep… models slash procurement, right? And in the same level, there are also going to be

395 00:48:25.230 00:48:26.740 Ashwini Sharma: procurement staging.

396 00:48:29.670 00:48:37.379 Ashwini Sharma: Right? And inside there, there will be, inside this one, there’ll be… parts and dimensions.

397 00:48:38.250 00:48:41.149 Ashwini Sharma: And this is just going to be SQL files, right?

398 00:48:45.950 00:48:52.049 Ashwini Sharma: So, actually, how do I display that?

399 00:48:58.750 00:49:03.009 Ashwini Sharma: Did you get what I’m trying to illustrate over here?

400 00:49:03.300 00:49:03.770 Awaish Kumar: Huh.

401 00:49:03.770 00:49:04.320 Uttam Kumaran: Yeah.

402 00:49:05.680 00:49:12.520 Ashwini Sharma: Okay, this is not the answer that you’re looking for, right? Can you give me a little bit more,

403 00:49:14.310 00:49:15.430 Ashwini Sharma: What do they call that?

404 00:49:15.430 00:49:21.919 Uttam Kumaran: I mean, I guess, like, yeah, I’m more under… trying to just understand, like, within a given layer.

405 00:49:21.950 00:49:41.479 Uttam Kumaran: Like, how, like, what are your naming kind of conventions, and, like, how you’re structuring it? So you kind of mentioned backed and DIM tables, you mentioned that there’s some staging, but do you typically, like, even if you think about the repo structure, like, do you have any thoughts on, like, how you organize your files and name things?

406 00:49:41.480 00:49:43.509 Ashwini Sharma: Yeah, so all the staging are…

407 00:49:44.010 00:49:46.679 Ashwini Sharma: Starts with, like, STG underscore something.

408 00:49:46.680 00:49:47.250 Uttam Kumaran: Okay.

409 00:49:47.920 00:49:50.510 Ashwini Sharma: Something, something, right?

410 00:49:50.850 00:50:03.149 Ashwini Sharma: Dimension starts with a DIM underscore something something, right? Facts will FCT underscore something something, right? There… there will be cases where…

411 00:50:03.280 00:50:14.789 Ashwini Sharma: business will ask for data that is neither a fact, neither a dimension, right? It’s more of a report kind of thing, right? In that case, it’ll be a report

412 00:50:14.940 00:50:17.770 Ashwini Sharma: Underscore something something, right?

413 00:50:17.990 00:50:22.399 Ashwini Sharma: And, there will be cases where, you know,

414 00:50:22.730 00:50:29.719 Ashwini Sharma: they are just interested in some basic numbers, right? In that case, it will be AGG underscore something something, which is…

415 00:50:29.860 00:50:37.169 Ashwini Sharma: Which is indicating that this is an aggregate data over some kind of a… some… some kind of underlying facts and dimensions.

416 00:50:39.090 00:50:40.149 Uttam Kumaran: Okay, okay.

417 00:50:40.470 00:50:47.280 Uttam Kumaran: I guess my question, is also, if we could talk about,

418 00:50:47.490 00:50:50.230 Uttam Kumaran: You know, something that happens on a lot of our clients, where

419 00:50:50.600 00:50:52.850 Uttam Kumaran: Let’s take an example of…

420 00:50:53.090 00:50:55.720 Uttam Kumaran: Hey, like, this data is missing.

421 00:50:56.270 00:51:00.419 Uttam Kumaran: You know, can you walk us through, like, how you kind of take

422 00:51:01.030 00:51:15.990 Uttam Kumaran: like, a client just saying, hey, my data’s missing, or this number doesn’t look right. Talk us through, like, the diagnosis process, like, that you would typically do. Let’s say, hey, someone messaged me to say, hey, DIM users looks off.

423 00:51:16.250 00:51:16.640 Ashwini Sharma: The…

424 00:51:16.640 00:51:23.500 Uttam Kumaran: revenue column, or, like, in DIM users, it’s, like, the country column, doesn’t look right.

425 00:51:23.630 00:51:32.610 Uttam Kumaran: talk to me about, like, how you kind of, like, wrangle with that, and I would say, maybe if we can roleplay a bit, think of me as, like, the customer, and I just Slack you.

426 00:51:32.850 00:51:35.469 Uttam Kumaran: Dim users off, a country field is wrong.

427 00:51:36.110 00:51:38.990 Uttam Kumaran: Walk me through, like, kind of, like, what you do or how you handle that.

428 00:51:38.990 00:51:56.820 Ashwini Sharma: Yeah, yeah, so basically, you look into the dim table itself, right, see what value you are seeing in that, and then you go back to the source table, and then see what values you are seeing over there, and maybe you can, you know, quickly churn out the data, like, instead of going through all the layers and then creating that dimension, you can directly look into that dimension, sorry.

429 00:51:56.820 00:52:01.060 Ashwini Sharma: The source table that contains the country information, and then figure out, you know.

430 00:52:01.060 00:52:02.929 Ashwini Sharma: What exactly the data should be.

431 00:52:02.930 00:52:16.099 Ashwini Sharma: And see… now, if you see an issue, right, that, you know, in the source you see something else, whereas in the DIM you are seeing something else, then probably there is something got messed up in the transformation between

432 00:52:16.100 00:52:28.110 Ashwini Sharma: source and the dimension table. You start looking into the queries, or maybe, like, the best way to do it is you click on the dimension table, look into the lineage that… that, you know.

433 00:52:28.110 00:52:38.839 Ashwini Sharma: That’s shown in the dbt diagrams, right? And then start exploring each model that has been used to create this dimension table, and then figure out, like, where things could have gone wrong.

434 00:52:39.540 00:52:45.449 Uttam Kumaran: So then, let’s say the client… you… I even… like, think about even this process of, like.

435 00:52:45.630 00:52:48.280 Uttam Kumaran: even less technical, right? The client is, like.

436 00:52:48.890 00:52:56.969 Uttam Kumaran: hey, I sent a message that DIM users was wrong. When is this gonna get fixed? Talk me through how you think about even providing an estimate.

437 00:52:57.200 00:53:11.499 Uttam Kumaran: Right? About, like, something like that. Because this is something that happens all the time, where our team gets hit with, hey, where is this thing? And one thing I tell our team is it’s not… at that moment, it’s not fair to the customer to not reply.

438 00:53:11.590 00:53:22.669 Uttam Kumaran: And just go and figure it out, and after 4 days, come back. You have to say something, so walk me through, like, how you… how are you… how do you typically deal with those types of, like, yeah, things with… with customers?

439 00:53:22.670 00:53:34.610 Ashwini Sharma: So, normally, you know, I prefer to get into a call, but if that’s not feasible at that time, I’m just going to acknowledge that, okay, you know, I have got your message, I’m looking into it.

440 00:53:34.680 00:53:45.750 Ashwini Sharma: If you could spend a few minutes, and then, you know, give me… give me some more details about what… what… what do you think is wrong about that dimension, right? Or… or maybe if they’re looking into some kind of a report.

441 00:53:45.750 00:53:58.910 Ashwini Sharma: what do you think is wrong? What do you think the right value looks like, right? And this has happened a lot with one of the businesses that I was working with, you know, and they were saying that, you know, this number is wrong, and then

442 00:53:59.050 00:54:11.579 Ashwini Sharma: they would not even say what is the correct number, right? So, basically, somehow it was wrong, and it took multiple iterations to get answers from them on what exactly was wrong, because

443 00:54:11.670 00:54:24.379 Ashwini Sharma: the query was working as expected, right? But they were still saying it was wrong, and what I was doing is, I was doing… I was implementing the logic that they have said, exactly, all the logic that they have given me.

444 00:54:24.660 00:54:42.250 Ashwini Sharma: exact logic was implemented in the transformation. Nothing wrong with the transformation, right? And still, they were saying, no, something is off with this number, something’s off with the number, right? And it took multiple rounds, several weeks of discussion to figure out that the initial logic that they had shared, that was wrong, right?

445 00:54:42.250 00:54:56.839 Ashwini Sharma: So when that changed, everything looked… started looking good. So yeah, I mean, it’s more a frequent communication, trying to understand what they feel is wrong, what they think is wrong, not just feel, and, you know, derive context from that.

446 00:54:58.460 00:55:05.149 Uttam Kumaran: Another question I had is, like, maybe you can zoom out a little bit. Let’s say you have an ex… you have a situation where a client says.

447 00:55:05.530 00:55:08.670 Uttam Kumaran: Yeah, I… this is all great, but you have one week.

448 00:55:09.270 00:55:09.860 Ashwini Sharma: Yeah.

449 00:55:10.660 00:55:14.159 Uttam Kumaran: And I need to get something in a dashboard in one week.

450 00:55:14.300 00:55:18.119 Uttam Kumaran: and you’re starting from scratch. So talk to me about what parts of this

451 00:55:18.300 00:55:21.380 Uttam Kumaran: Of your ideal situation that you would…

452 00:55:21.600 00:55:25.119 Uttam Kumaran: decide not to do, right? Or what would you sacrifice?

453 00:55:27.280 00:55:33.069 Ashwini Sharma: Yeah, low privacy. Oh, okay. What would I sacrifice in the architecture? Okay. Yeah.

454 00:55:33.070 00:55:37.379 Uttam Kumaran: Yeah, no, no, of course, like, yeah, don’t do… they’ll say they give you a priority, they said.

455 00:55:38.010 00:55:44.150 Uttam Kumaran: They… let’s say they tell you, okay, these sources, blah blah blah, but what part of your architecture do you decide to sacrifice, and why?

456 00:55:44.580 00:55:51.599 Ashwini Sharma: Mmm… Just one week, oh, man, that’s,

457 00:55:54.150 00:55:58.589 Ashwini Sharma: Okay, we need to give them something very soon, right? Yeah?

458 00:56:03.260 00:56:08.639 Ashwini Sharma: So, in this case, what I can do is, you know…

459 00:56:09.120 00:56:13.789 Ashwini Sharma: like, I’m making an assumption that we already have ingested the data, right?

460 00:56:15.340 00:56:19.910 Uttam Kumaran: That’s what… you’ll have… you’ll be doing that too, but that… you’re right, there’s, like, not much…

461 00:56:20.300 00:56:20.620 Ashwini Sharma: Yeah.

462 00:56:20.620 00:56:21.910 Uttam Kumaran: creativity there.

463 00:56:22.140 00:56:25.750 Uttam Kumaran: I guess I’m more interested in… in the…

464 00:56:25.920 00:56:31.069 Uttam Kumaran: in this area, right? Like, what parts of this are you deciding not to do?

465 00:56:31.660 00:56:33.000 Uttam Kumaran: Or moving, or moving.

466 00:56:33.000 00:56:46.279 Ashwini Sharma: For this case, what I would do is I would create a temporary layer above the raw layer, or above the base layer, basically, right? And create views on that, which sort of, you know.

467 00:56:47.350 00:57:05.529 Ashwini Sharma: get the data that the customer is looking for directly, right, without going through extensive modeling, creating dimensions and facts. I’ll just create views, which essentially is kind of looking into the raw data and displaying things that the customer is interested in, directly calculating those on the fly.

468 00:57:05.590 00:57:07.970 Ashwini Sharma: And then expose them into the dashboard.

469 00:57:08.750 00:57:09.320 Uttam Kumaran: Okay.

470 00:57:10.350 00:57:14.319 Uttam Kumaran: I guess, what parts… what parts of the architecture do you think you can’t sacrifice?

471 00:57:15.150 00:57:16.910 Ashwini Sharma: what I can’t sacrifice?

472 00:57:17.360 00:57:17.980 Uttam Kumaran: Yeah.

473 00:57:21.240 00:57:30.410 Ashwini Sharma: Well, I can’t sacrifice this part of the pipeline, right, which is basically ingesting data up till here. I cannot sacrifice,

474 00:57:31.180 00:57:34.369 Ashwini Sharma: I cannot sacrifice this layer, base layer, right?

475 00:57:34.490 00:57:41.690 Ashwini Sharma: Because I have seen cases where, you know, few queries have run into lots of dollars, and…

476 00:57:42.290 00:57:49.840 Ashwini Sharma: Yeah. Kind of, this part I can’t sacrifice. I cannot sacrifice,

477 00:57:50.850 00:57:59.030 Ashwini Sharma: some form of data governance, right? Which… which I want it to be there so that somebody does not mess up things in the raw layer.

478 00:57:59.360 00:58:01.339 Ashwini Sharma: Which is still possible, right?

479 00:58:02.840 00:58:09.949 Ashwini Sharma: So… Yeah, these three parts can be sacrificed, I don’t want to sacrifice these two parts.

480 00:58:10.590 00:58:12.949 Ashwini Sharma: Okay. Raw layer and the base layer.

481 00:58:14.160 00:58:14.680 Uttam Kumaran: Okay.

482 00:58:15.830 00:58:24.039 Uttam Kumaran: Yeah, maybe one last question for me sort of scenario is, like, let’s say you have a dbt job that’s taking a long time to run.

483 00:58:24.470 00:58:27.499 Uttam Kumaran: Right? Let’s say you’re running 25, 30 models.

484 00:58:27.720 00:58:34.889 Uttam Kumaran: Like, walk me through, like… your diagnosis, And how you identify

485 00:58:35.110 00:58:37.279 Uttam Kumaran: What the next steps are in order to…

486 00:58:38.190 00:58:47.089 Uttam Kumaran: you know, like, kind of, like, what your criteria is, and, like, how you actually go on to solve, you know, a problem, with long-running DVT jobs.

487 00:58:47.600 00:58:58.920 Ashwini Sharma: Yeah, there could be multiple reasons why it’s running slow, right? So, maybe, like, I’ll just bucket some of these reasons, right? One of the reasons is

488 00:58:59.460 00:59:06.219 Ashwini Sharma: It is slow because, the way data is organized in the role here.

489 00:59:06.330 00:59:16.069 Ashwini Sharma: that is wrong. It should be well partitioned, right? And maybe it’s just, you know, just dumped data, and because of which.

490 00:59:16.270 00:59:22.889 Ashwini Sharma: it’s taking a lot of time to read. That is one of the reasons, so let’s put that into one bucket. The other bucket is…

491 00:59:23.030 00:59:29.399 Ashwini Sharma: maybe the queries that are written in the dbt jobs are not optimal, they are not well planned.

492 00:59:30.130 00:59:36.509 Ashwini Sharma: that could be another reason. The other reason could be, like, since there are lots of TBT models, right.

493 00:59:36.690 00:59:41.509 Ashwini Sharma: It is possible that you are repeating certain transformations again and again.

494 00:59:42.210 00:59:47.750 Ashwini Sharma: so… Maybe there is some scope of reducing,

495 00:59:48.390 00:59:55.720 Ashwini Sharma: the number of transformations that you are doing, because you’ll be creating multiple CTEs, right, within each dbt model, and…

496 00:59:55.860 01:00:07.720 Ashwini Sharma: like, when the number of models are large, I’m pretty sure there is a good chance that, you know, there is this transformation repetition happening in the entire codebase.

497 01:00:07.870 01:00:10.200 Ashwini Sharma: If… so…

498 01:00:10.740 01:00:17.809 Ashwini Sharma: If the data is not partitioned well, then we’ll have to look into how to partition the data properly so that the queries are more efficient.

499 01:00:17.930 01:00:25.699 Ashwini Sharma: If the queries are not written correctly, right, maybe bad joins, maybe bad selects, right?

500 01:00:26.080 01:00:30.919 Ashwini Sharma: In that case, you’ll have to look in the queries and address each query wherever,

501 01:00:31.280 01:00:38.949 Ashwini Sharma: the query is not planned properly, look into the plan of the execution plan of the query, and then see how it is executing, right?

502 01:00:40.100 01:00:45.279 Ashwini Sharma: So, yeah, that’s the other thing, and what was the last that I said?

503 01:00:45.500 01:01:02.410 Ashwini Sharma: Yeah, repetitions, avoid that, wherever possible, right? Select the least number of columns that is needed to create the transformation models, avoid select stars, right? Avoid select distincts, right? Those are the things that I would look into.

504 01:01:04.310 01:01:04.840 Uttam Kumaran: Cool.

505 01:01:06.470 01:01:13.150 Uttam Kumaran: Okay, yeah, so, I mean, to give you a sense, like, these are, like, of course, all the classic issues that we’re dealing with every day.

506 01:01:13.470 01:01:27.150 Uttam Kumaran: You know, we’re walking into a situation where there’s jobs that are messed up, client messages, hey, this report doesn’t work, or it’s not loading. And so a lot of our job is actually not about solving the problem, but explaining

507 01:01:27.390 01:01:37.200 Uttam Kumaran: Right? And so, the reason why we do this exercise is, if you get in on with a client who’s not particularly happy, how do you break down

508 01:01:37.260 01:01:48.180 Uttam Kumaran: what it is we’re doing and why that there’s complexity, right? There’s a lot of people, I think, that can go in and solve it, but that is actually only 50% of, like, what we do.

509 01:01:48.400 01:01:51.629 Uttam Kumaran: It’s different than working internally in a company, because…

510 01:01:51.850 01:02:04.539 Uttam Kumaran: client… and when you’re internal to a company, you can’t decide who your data team is. But the client can decide that we’re no longer their data team, right? And so 50% of our job is communication.

511 01:02:05.710 01:02:14.359 Uttam Kumaran: And so it’s… it means… it makes a huge difference that when a client… if a client asks you, like, hey, this job is running long, like, how would you fix it?

512 01:02:14.360 01:02:16.119 Uttam Kumaran: It’s not… the answer is not…

513 01:02:16.120 01:02:35.549 Uttam Kumaran: no worry, I’ll fix it. It’s like, yeah, here’s exactly the step-by-step way, that we’ll do things, and you build empathy with them, right? On, like, really why this is complicated. A lot of our clients have never done this before. They don’t have an appreciation for how difficult this is, and this is the reason why I asked about sacrifices. They don’t… they’re not aware

514 01:02:35.590 01:02:41.169 Uttam Kumaran: Of the trade-offs, right? But a bad engineer will make trade-offs without ever explaining.

515 01:02:41.370 01:02:46.190 Ashwini Sharma: Right. And so for us, it’s up to us to not only teach, but then also explain.

516 01:02:46.260 01:03:02.129 Uttam Kumaran: here are the two choices you have. We could run really fast and not be able to do this, but here’s how it may come back to bite you. And we’re always wrangling with these choices, but it’s… the reason why I explain that is, like, we’re not always in, like, a perfect engineering situation where

517 01:03:02.230 01:03:18.070 Uttam Kumaran: we have, like, years and years and tons of resources. For the most part, we are making compromises every day in our architecture in order to drive value today, but that is a decision, like, we make as that team, and we make with the client, right? We don’t make

518 01:03:18.360 01:03:21.609 Uttam Kumaran: Sort of just in our heads and running, so, yeah.

519 01:03:24.180 01:03:24.950 Ashwini Sharma: Yep.

520 01:03:27.040 01:03:31.080 Uttam Kumaran: Cool. Any other… any questions for us?

521 01:03:31.870 01:03:32.660 Uttam Kumaran: Ashwini?

522 01:03:32.660 01:03:38.619 Ashwini Sharma: Now, what is the name of the tool you just said for ingestion that you’re using? You mentioned it once.

523 01:03:38.620 01:03:39.730 Uttam Kumaran: Polyatomic.

524 01:03:40.120 01:03:41.349 Ashwini Sharma: polyatomic, yeah.

525 01:03:42.040 01:03:49.279 Uttam Kumaran: Yeah, Polytomic. Yeah, they’re, like, very, very, not well-known. They don’t do a lot of marketing, but…

526 01:03:49.840 01:03:51.829 Uttam Kumaran: Really good, and good, good team.

527 01:03:52.050 01:03:57.040 Ashwini Sharma: Yeah, you told me last time, and then I didn’t write it down, and then it went off my head by the time.

528 01:03:57.040 01:03:57.490 Uttam Kumaran: Yeah, yeah.

529 01:03:57.490 01:03:58.610 Ashwini Sharma: with the interviewer.

530 01:03:59.730 01:04:02.170 Uttam Kumaran: Yeah, polyatomic.com, yeah.

531 01:04:04.820 01:04:07.789 Uttam Kumaran: Cool. Okay, any other questions, guys?

532 01:04:09.520 01:04:10.680 Awaish Kumar: Nope, not from my side.

533 01:04:12.350 01:04:13.170 Uttam Kumaran: Okay.

534 01:04:13.880 01:04:14.450 Ashwini Sharma: Alright.

535 01:04:14.450 01:04:20.379 Uttam Kumaran: Anything else that’s we need on the team, or anything? And then, if not, then I think we’ll probably be in touch today.

536 01:04:20.710 01:04:22.760 Ashwini Sharma: No, I don’t have entertainment, no.

537 01:04:23.000 01:04:23.600 Uttam Kumaran: Okay.

538 01:04:23.760 01:04:38.040 Uttam Kumaran: All right, thank you. I know I appreciate you going… I appreciate you going through this. I think part of the reason that we’re starting to do more of diagramming and these types of interviews is just because this is, like, what a lot of our clients are… are… need to see.

539 01:04:38.040 01:04:42.760 Ashwini Sharma: And, like, this is how we communicate with them about our technical depth, because what do they say, if you can…

540 01:04:42.880 01:04:48.300 Uttam Kumaran: The only… when you know you mastered something is when you can teach, and when you can diagram, right?

541 01:04:48.300 01:04:49.090 Ashwini Sharma: Right, yeah.

542 01:04:49.090 01:04:56.180 Uttam Kumaran: So, that’s ultimately… a lot of our customers are so interested in how we think about these things, it’s not just enough that we can go do it.

543 01:04:56.470 01:05:02.759 Uttam Kumaran: it’s like, we can actually teach and diagram and explain, so I appreciate the time today and for doing this.

544 01:05:02.760 01:05:04.660 Ashwini Sharma: Sure, yeah, thanks for talking to me.

545 01:05:04.940 01:05:06.640 Ashwini Sharma: Talking with me.

546 01:05:06.920 01:05:08.530 Ashwini Sharma: Oh, of course. Yeah.

547 01:05:08.530 01:05:09.300 Uttam Kumaran: Okay.

548 01:05:09.300 01:05:09.760 Ashwini Sharma: Okay.

549 01:05:09.760 01:05:11.450 Uttam Kumaran: Thank you. Talk to you soon.

550 01:05:11.450 01:05:12.840 Ashwini Sharma: Have a nice evening, bye.