Meeting Title: Data Engineer Interview (Abhijith Thakur) Date: 2025-08-08 Meeting participants: Abhijith Thakur, Awaish Kumar


WEBVTT

1 00:01:51.400 00:01:52.200 Awaish Kumar: Hi!

2 00:01:56.590 00:01:57.500 Awaish Kumar: Hello!

3 00:01:59.390 00:02:01.419 Abhijith Thakur: Hi! Amish! Good morning!

4 00:02:02.900 00:02:03.810 Awaish Kumar: Good morning!

5 00:02:03.970 00:02:05.119 Abhijith Thakur: Can you see me.

6 00:02:05.840 00:02:07.180 Awaish Kumar: Yes, yes, thank you.

7 00:02:08.009 00:02:10.099 Abhijith Thakur: But yeah, hi.

8 00:02:11.330 00:02:12.909 Awaish Kumar: Hi! How are you?

9 00:02:12.910 00:02:14.489 Abhijith Thakur: I’m doing good. How are you?

10 00:02:15.820 00:02:20.809 Awaish Kumar: I’m good as well. So so yeah, last time we

11 00:02:21.260 00:02:24.059 Awaish Kumar: I don’t know it. It was in the middle of the.

12 00:02:24.440 00:02:32.179 Abhijith Thakur: Yes, actually, there was a power outage at my place, so I did not. Get a chance to get back to you

13 00:02:33.610 00:02:38.700 Abhijith Thakur: like there was the the weather was bad, and then it was there was complete.

14 00:02:41.152 00:02:44.230 Awaish Kumar: So yeah, we could start today.

15 00:02:44.510 00:02:48.600 Abhijith Thakur: I I just just a request. I can’t hear you properly.

16 00:02:50.310 00:02:51.819 Awaish Kumar: You can’t hear me properly.

17 00:02:51.820 00:02:52.710 Abhijith Thakur: Yeah.

18 00:02:53.340 00:02:54.859 Awaish Kumar: Is it better now.

19 00:02:55.340 00:02:58.030 Abhijith Thakur: No, I mean I can barely hear you.

20 00:03:01.530 00:03:06.220 Awaish Kumar: Okay, actually, I use my

21 00:03:08.140 00:03:10.350 Awaish Kumar: get forward. But I don’t know.

22 00:03:13.790 00:03:16.510 Awaish Kumar: Okay, give me a moment. I’ll get back.

23 00:04:24.101 00:04:25.960 Awaish Kumar: Can you call me now?

24 00:04:29.340 00:04:36.650 Abhijith Thakur: Yes, I it’s fine. Yes, I can hear you. But yeah.

25 00:04:38.820 00:04:43.510 Awaish Kumar: Like, I think, then a good leader. Maybe if you have a hard, so it could be better.

26 00:04:43.510 00:04:45.649 Abhijith Thakur: I still can’t hear. I’m sorry what.

27 00:04:47.290 00:04:51.420 Awaish Kumar: Yeah, I I’m saying that. Then if you have some headsets.

28 00:04:51.620 00:04:53.700 Awaish Kumar: but that can give you better.

29 00:05:14.480 00:05:25.030 Awaish Kumar: Yeah, I I’m saying that in that case. If if it’s not working, then you might need headset because I tried using with and without my headsets, and

30 00:05:25.320 00:05:27.820 Awaish Kumar: it’s still not clear for you.

31 00:05:28.650 00:05:34.170 Abhijith Thakur: I mean, that’s fine. Yeah, we can go ahead. I mean, I can’t hear you properly, but you know there is

32 00:05:34.460 00:05:37.790 Abhijith Thakur: not so clear. But that’s fine. We can go ahead.

33 00:05:38.640 00:05:45.599 Awaish Kumar: Okay. Because, yeah, I have tried my access. But maybe if you have and check would be better.

34 00:05:45.850 00:05:48.680 Abhijith Thakur: I don’t. I mean, that’s fine.

35 00:05:49.900 00:05:54.950 Awaish Kumar: Okay. Clock phone.

36 00:05:55.420 00:05:56.800 Awaish Kumar: Yeah. What?

37 00:05:57.890 00:06:02.940 Awaish Kumar: Yeah. Okay? Sorry I. We can start if I don’t know. Like.

38 00:06:04.710 00:06:09.030 Awaish Kumar: from last time we were like discussing some of your projects. Right?

39 00:06:09.410 00:06:10.000 Abhijith Thakur: Yeah.

40 00:06:11.290 00:06:19.419 Awaish Kumar: So if you can briefly give, like maybe like to start with a 1 of your project

41 00:06:19.730 00:06:24.579 Awaish Kumar: which I’ve worked on, and it’s tech stake and things like that, so we could just start from there.

42 00:06:26.228 00:06:29.100 Abhijith Thakur: Talking about one of my projects.

43 00:06:30.100 00:06:52.389 Awaish Kumar: Yeah, give me example of a project where you think, which would you think better fits in in the brain force needs, for example, as a complex data pipeline project or some of the project where you are most proud of to build something, build a data pipeline, or you build a optimize the data pipeline. Or you build a modeling out of the data warehouse.

44 00:06:52.600 00:06:53.610 Awaish Kumar: Anything.

45 00:06:54.540 00:06:55.500 Abhijith Thakur: Yeah.

46 00:06:55.630 00:07:25.509 Abhijith Thakur: sure. So basically, there was a project where I was working on complete data pipeline during my academic work with the like, with the data breaks. And so basically, the goal was to process the steps volume of the sales and the customer data to to generate the actionable insights and the predictive models. So I started by ingesting ingesting raw Csv data into the Aws

47 00:07:25.600 00:07:26.969 Abhijith Thakur: Aws S. 3,

48 00:07:27.170 00:07:34.909 Abhijith Thakur: and then mounted it onto databricks. And I used Pispark to clean and transform and join multiple.

49 00:07:34.910 00:07:39.390 Awaish Kumar: Ingested. How did you ingest the data to hit a process.

50 00:07:40.350 00:07:42.549 Abhijith Thakur: How? How did I ingest my data?

51 00:07:43.830 00:07:48.069 Awaish Kumar: How did you ingested the Csv files to aws S. 3.

52 00:07:49.220 00:08:00.765 Abhijith Thakur: Okay? So I mean, in order to ingest the data into aws, s, 3, basically. The data data breaks where music data breaks where I’ve used dB,

53 00:08:01.270 00:08:18.369 Abhijith Thakur: dB commands to copy the files from the local or some external sources into the aws. And then the Databricks workshop, using the S. 3. So basically, I mounted the function to mount the S. 3 into a directory into the database.

54 00:08:18.840 00:08:24.170 Abhijith Thakur: and then I uploaded the raw Csv file, and I’m sorry.

55 00:08:25.380 00:08:27.709 Awaish Kumar: Yeah. From then from s. 3,

56 00:08:28.080 00:08:31.400 Awaish Kumar: the the data goes to which database.

57 00:08:33.940 00:08:36.319 Abhijith Thakur: From S. 3. To which database?

58 00:08:37.799 00:08:52.789 Abhijith Thakur: Yes. So the data, we, the data was actually going into Delta Lake Format. And and query directly into the database without mounting it into a separate database. So the in this, in the project.

59 00:08:53.350 00:09:00.830 Awaish Kumar: Yeah, you, you mentioned that when data comes to S. 3, then you mount it and then using, I’m asking like.

60 00:09:01.110 00:09:05.070 Awaish Kumar: once it is an s. 3. Then how you use it for the.

61 00:09:06.880 00:09:10.459 Abhijith Thakur: Once it is in the S. 3. And how do I use it further?

62 00:09:10.990 00:09:19.889 Abhijith Thakur: So I’m I mean, I’m basically what I’m saying is, once the data is being processed into the S. 3. Then, I’ll be using data procedures.

63 00:09:19.890 00:09:43.909 Abhijith Thakur: data, lake procedures where I read the raw data using spark Apis, and then I read the files directly from S. 3, and then I perform all the required pre-processing steps, using a pispark where I handle typecasting aggregations and joins and then applying business logics and then write the data into the data lakes.

64 00:09:43.910 00:09:52.749 Abhijith Thakur: And once once clean, once the data has been clean and the data is saved back to the S. 3 in delta format.

65 00:09:52.990 00:09:58.160 Abhijith Thakur: which enables the assets transformation and efficient.

66 00:09:59.680 00:10:02.570 Awaish Kumar: So you mentioned about using data. And there.

67 00:10:03.430 00:10:04.170 Abhijith Thakur: So.

68 00:10:04.680 00:10:05.829 Awaish Kumar: Which data lake.

69 00:10:08.500 00:10:09.960 Abhijith Thakur: Data lake.

70 00:10:11.190 00:10:18.157 Abhijith Thakur: Yes, I mean, I was saying, like, I was actually working upon

71 00:10:18.750 00:10:25.829 Abhijith Thakur: Delta lakes as like as in a Delta lake format, which is built on Aws. S. 3.

72 00:10:26.340 00:10:44.290 Abhijith Thakur: And I, I implemented the data lake architecture, using data Delta lakes on Amazon, s. 3, where Amazon, s. 3 like the storage layer and Delta Lake, acted as a data lake engine, which is an open source storage layer.

73 00:10:46.260 00:10:48.240 Awaish Kumar: Okay, join in.

74 00:10:50.910 00:10:57.800 Awaish Kumar: Okay? And like, what is the data volume you have been working with?

75 00:10:59.460 00:11:01.670 Abhijith Thakur: The data volume I have been working with.

76 00:11:02.530 00:11:09.940 Abhijith Thakur: Yeah, I mean, with respect to the data volume which

77 00:11:10.540 00:11:39.180 Abhijith Thakur: which I was working it might actually, it was a higher data where usually we were getting these data from multiple sources. So basically around 20 million, I could say to over millions of records depending on the use cases, and especially in the cases of health the healthcare, and the the other. The telecom was the other domain which was telecom, where? My

78 00:11:40.140 00:11:57.079 Abhijith Thakur: the anomaly detection pipeline, where I worked on 40 to 50 million automations, whereas in the case of Tcs. When I worked in Tcs. While I was working there, I cleaned and transformed over 20 plus millions of subscribers.

79 00:11:58.830 00:11:59.460 Awaish Kumar: Hello.

80 00:12:02.640 00:12:07.149 Awaish Kumar: 20 million plus subscribers. So how many

81 00:12:07.320 00:12:11.250 Awaish Kumar: like this data data must be coming in legal time?

82 00:12:13.320 00:12:14.780 Abhijith Thakur: I didn’t get you sorry.

83 00:12:16.420 00:12:18.510 Awaish Kumar: 20 million subscriber, often.

84 00:12:18.510 00:12:18.890 Abhijith Thakur: Yes.

85 00:12:18.890 00:12:19.760 Awaish Kumar: Location, Right.

86 00:12:21.410 00:12:27.300 Awaish Kumar: So were you collecting the data in real time or packs, or how.

87 00:12:29.960 00:12:48.259 Abhijith Thakur: the data was come like is being collected, and then I mean, on the real time basis it varies from project to project but it. But talking about the Tcs project, it was not a real time it was a batch mode. It was in the batch mode batch processing

88 00:12:49.150 00:12:54.490 Abhijith Thakur: like typically we do it weekly on on weekly schedule.

89 00:12:56.170 00:13:00.640 Awaish Kumar: So your Csv files, which are, where are they coming from?

90 00:13:01.470 00:13:02.870 Abhijith Thakur: The Csv files.

91 00:13:04.140 00:13:08.679 Awaish Kumar: The Csv files? Which are you ingesting to a Kwss? 3.

92 00:13:08.810 00:13:12.900 Awaish Kumar: Where are they coming from? Like? What is the source of your data?

93 00:13:13.500 00:13:34.519 Abhijith Thakur: So like the Csv file was sourced from the internal data internal data. Either it might be the enterprise. Database, like oracle are provided by the clients or stakeholders to use the to use it as a part of batch ingestion workflows, and talking about like this cigna, which is my recent client.

94 00:13:34.640 00:13:41.349 Abhijith Thakur: and the Csv files were generated and scheduled from the internal database into the oracle.

95 00:13:41.350 00:13:47.329 Awaish Kumar: I don’t want to. I don’t want you to jump across the clients. We want to stick on a 1 single project.

96 00:13:47.640 00:13:48.259 Abhijith Thakur: Okay-, okay.

97 00:13:48.260 00:13:54.339 Awaish Kumar: Which you are talking about. Like as you’re most proud of it. You must have done some good work on that.

98 00:13:54.550 00:14:05.230 Awaish Kumar: My point is that you mentioned that data is tested through Csv files to S. 3, and you are choosing data lake architecture to further process it and

99 00:14:05.350 00:14:09.329 Awaish Kumar: and then storage again into some storage. Right?

100 00:14:10.010 00:14:21.160 Awaish Kumar: The my point is that the data which is going to S. 3. What is that data where it is coming from? Like if it’s Csv, even if it is Csv file like, what? What is the

101 00:14:21.570 00:14:26.199 Awaish Kumar: like? The Csv files? Where are they being populated? What kind of data it is? How how.

102 00:14:26.200 00:14:26.640 Abhijith Thakur: She is.

103 00:14:26.640 00:14:27.230 Awaish Kumar: Clock.

104 00:14:27.670 00:14:31.249 Awaish Kumar: How many files are there like this scale of that.

105 00:14:31.940 00:14:48.920 Abhijith Thakur: Okay? So basically, the Csv files were generated as a as a part of scheduled exports from the internal oracle databases. This data were already generated from the from the internal databases which include

106 00:14:48.920 00:15:13.100 Abhijith Thakur: the processing the system provider and member data platforms and billing and financial system. So basically, the Dba teams who’ve been scheduling the Etl jobs would extract a daily or a weekly snapshots of the specified tables or reports from these internal systems. And then they extract those data which were exported into the Csv files.

107 00:15:13.100 00:15:20.240 Abhijith Thakur: So typically through oracle SQL and Etl tools and internal job schedulers, and then

108 00:15:20.910 00:15:23.150 Abhijith Thakur: they were delivered in 2 s. 3.

109 00:15:23.150 00:15:25.541 Awaish Kumar: Yeah, why are you using

110 00:15:26.950 00:15:35.277 Awaish Kumar: Cs like, why are you using csv export from a database and then load it into the

111 00:15:36.460 00:15:45.980 Awaish Kumar: test. 3. Like, why not find some tools which can directly send data from Oracle to S. 3.

112 00:15:46.940 00:15:48.130 Abhijith Thakur: So I mean

113 00:15:48.260 00:15:57.224 Abhijith Thakur: database and from based on the project requirement, which? Why I was working upon so

114 00:15:58.731 00:16:12.860 Abhijith Thakur: like. There were tools like aws, glue, and dB dms, which we which can be, which can use very directly to connect to the database S 3 buckets and Csv match. But exports

115 00:16:13.475 00:16:20.409 Abhijith Thakur: the built in. But in the project basically, there was a regular and compliance auditability. So basically,

116 00:16:21.827 00:16:36.239 Abhijith Thakur: the Csv exports allow the perspective of the snapshot, which is completed with the metadata file snapshot and versioning this checksum and these static files can be archived.

117 00:16:37.662 00:16:41.527 Abhijith Thakur: Yeah. So the data owners are actually

118 00:16:42.877 00:16:57.230 Abhijith Thakur: belongs to a different team. So who are actually providing the data, also, the direct access to the production database was restricted in that environment. And that is why the model we’ve been that that was being followed.

119 00:16:58.750 00:17:05.119 Awaish Kumar: But you mentioned that you were using the what does he?

120 00:17:06.829 00:17:08.020 Awaish Kumar: Data bricks.

121 00:17:08.250 00:17:09.570 Abhijith Thakur: Data, bricks? Yes.

122 00:17:10.859 00:17:18.439 Awaish Kumar: So that first, st then, any feature in databricks to move that data through s. 3.

123 00:17:19.259 00:17:20.099 Awaish Kumar: But it.

124 00:17:21.949 00:17:26.479 Abhijith Thakur: Featuring data bricks to use, I mean.

125 00:17:28.429 00:17:31.559 Abhijith Thakur: I spoke like I speak about the fee.

126 00:17:32.629 00:17:33.469 Abhijith Thakur: Oh, good.

127 00:17:33.906 00:17:37.400 Awaish Kumar: Bricks is A is a complete platform, right

128 00:17:37.740 00:17:40.489 Awaish Kumar: for your data, providing your data pipelines.

129 00:17:40.850 00:17:45.130 Awaish Kumar: It can do ET. And all like the full pipeline.

130 00:17:45.470 00:17:46.490 Awaish Kumar: So.

131 00:17:46.880 00:17:55.800 Awaish Kumar: but like, you meant when you are working with data pricks. So you mentioned, you are using data pricks and then using Pispark on top of it

132 00:17:56.070 00:17:57.899 Awaish Kumar: to basically process your data.

133 00:17:58.350 00:18:04.240 Awaish Kumar: So why the initiation part was not part of Beta Brinks.

134 00:18:05.600 00:18:06.100 Abhijith Thakur: I mean.

135 00:18:06.100 00:18:08.110 Awaish Kumar: Or was. It was a part of books.

136 00:18:08.110 00:18:08.950 Abhijith Thakur: I’m sorry.

137 00:18:10.530 00:18:14.899 Awaish Kumar: How that ingestion part was automated. Right? You are

138 00:18:15.900 00:18:20.940 Awaish Kumar: so external. You you mentioned. You take an export from oracle database.

139 00:18:21.090 00:18:27.859 Awaish Kumar: then then loads gets loaded to awss 3, and then that file.

140 00:18:28.030 00:18:34.190 Awaish Kumar: This gets archived, or whatever. So how that process was submitted.

141 00:18:36.652 00:18:39.340 Abhijith Thakur: Usually, I mean, we

142 00:18:40.140 00:19:00.652 Abhijith Thakur: in in that process. I was like I was talking talking about the actual multiple companies like the security, which is actually one of the security was one of the major concerns, and the data was coming from different teams which we don’t own the data, and we do not have the access to

143 00:19:01.440 00:19:03.740 Abhijith Thakur: to the database. That is why we.

144 00:19:04.160 00:19:06.269 Awaish Kumar: Who was loading the data quest. 3.

145 00:19:07.070 00:19:09.650 Abhijith Thakur: You said, who was loading the data to S. 3.

146 00:19:11.140 00:19:15.100 Abhijith Thakur: It was me, I mean me and my team. We were loading the data into the.

147 00:19:15.100 00:19:19.799 Awaish Kumar: That’s the that’s my question. Right? So, for example, data is in oracle.

148 00:19:21.290 00:19:24.959 Awaish Kumar: Should somebody exported the data from oracle database and.

149 00:19:25.770 00:19:32.300 Awaish Kumar: And then create a Csv, so now that Csv, when you are going to load it so that Csv

150 00:19:32.920 00:19:35.610 Awaish Kumar: needs to be somewhere where you access it.

151 00:19:35.840 00:19:43.670 Awaish Kumar: download it, load it to s 3, archive it like, and it needs to be automated like you can’t do it right now.

152 00:19:43.910 00:19:52.789 Awaish Kumar: manually like, get a file download somewhere, and whatever like it has to be a data pipeline, basically, which

153 00:19:53.010 00:20:00.590 Awaish Kumar: it’s receives that export and then sends us to a screen. So what were you doing like what tools you were using, or what

154 00:20:01.720 00:20:04.809 Awaish Kumar: code you were writing to handle that? That’s my question.

155 00:20:05.950 00:20:34.100 Abhijith Thakur: Okay, I mean, the pipeline was implemented as a batch ingestion pipeline where which was automated using the schedule shell or python script in oracle data oracle, SQL. Experts, and these like aws, cli, and these libraries that were used for moving the data from on prem to these that to the S. 3. So basically over over there at first, st we.

156 00:20:34.100 00:20:37.900 Awaish Kumar: Quite a lot of things like shell python, script awcli.

157 00:20:39.480 00:20:42.899 Awaish Kumar: Not all of them you were using. You will be using some someone.

158 00:20:46.760 00:20:47.890 Abhijith Thakur: I didn’t get you sorry.

159 00:20:48.930 00:21:03.420 Awaish Kumar: My question is what exactly you are using. For example, if I write a pipeline, I can tell you like this is what I did like. This is the tool I use. This is the script I wrote, and then I schedule it that way. So what is your way?

160 00:21:05.150 00:21:06.839 Abhijith Thakur: Okay. So

161 00:21:07.780 00:21:29.690 Abhijith Thakur: I’ll start step by step, because I see, I was actually talking about the exposure which I have while working across multiple projects. So basically, I built a batch oriented data pipeline that extracts the data from the oracle and then stores it into aws. S, 3, process, it process the the data using spark and

162 00:21:30.040 00:21:35.080 Abhijith Thakur: and the surface inside using the tableau. So basically@firstst

163 00:21:35.080 00:21:37.500 Awaish Kumar: My question how you extract that.

164 00:21:37.940 00:21:38.969 Awaish Kumar: It’s like

165 00:21:40.320 00:21:47.250 Awaish Kumar: that I am. What I’m getting you from getting from you is that you have a data in s. 3.

166 00:21:47.450 00:21:54.900 Awaish Kumar: Then you read it through by spark, and then you process it. But there must be someone who is loading data to S. 3.

167 00:21:56.150 00:21:57.050 Awaish Kumar: Alright.

168 00:21:58.370 00:22:00.909 Awaish Kumar: Who is that like my coin is?

169 00:22:01.570 00:22:11.780 Awaish Kumar: I’m I’m I understand your point where you are saying that you are using data breaks to connect to S. 3, get the data, process it and store it back to S. 3. That’s okay.

170 00:22:12.070 00:22:16.029 Awaish Kumar: But that is transformation and loading part. And therefore you are doing.

171 00:22:16.330 00:22:17.790 Awaish Kumar: I I got you.

172 00:22:18.000 00:22:22.109 Awaish Kumar: But are you handling also the extraction part, or you are not handling that.

173 00:22:24.290 00:22:24.710 Abhijith Thakur: Let’s.

174 00:22:24.710 00:22:30.159 Awaish Kumar: Someone is there to load data to S. 3, either you or someone else? That’s my question.

175 00:22:30.460 00:22:45.219 Abhijith Thakur: I was not handling the extraction part. It was being managed by a different team. That is why I told like, because the security reasons, because that was actually managed by a separate team where we did not have the access to those database. That is why like, create.

176 00:22:45.220 00:22:45.890 Awaish Kumar: No, I like.

177 00:22:45.890 00:22:46.990 Abhijith Thakur: I’m using.

178 00:22:46.990 00:22:48.260 Awaish Kumar: Give you an example.

179 00:22:49.010 00:22:58.960 Awaish Kumar: I was working with a team when I didn’t have access to database. So what was happening there? They create export, and then they load it to us

180 00:22:59.340 00:23:01.240 Awaish Kumar: of ftp. Server.

181 00:23:01.750 00:23:07.710 Awaish Kumar: Right, they create an export, and they loaded it in a at a at some place which is a

182 00:23:07.910 00:23:09.260 Awaish Kumar: Ftp server.

183 00:23:09.450 00:23:13.489 Awaish Kumar: So now, my task is to get that data from that.

184 00:23:14.200 00:23:18.510 Awaish Kumar: Yeah, that Csv file from Ftp server and load it to

185 00:23:19.010 00:23:24.570 Awaish Kumar: S. 3. So what I’m doing is that I’m using, for example, a python script

186 00:23:24.670 00:23:31.729 Awaish Kumar: which basically downloads the file from Ftp server and loads this to S. 3 and then deletes it from

187 00:23:32.441 00:23:40.570 Awaish Kumar: ftp, and and that pipeline was like that script was written in python and scheduled through airflow.

188 00:23:40.700 00:23:54.670 Awaish Kumar: Right? So my air flow is basically triggering the Python script, which basically downloads a file from Ftp server and loads it to S. 3. So that was my intention part. And then, when data is s. 3, I’m writing another pipeline in data breaks

189 00:23:55.450 00:24:05.939 Awaish Kumar: to just to transform and load. So that’s that’s how I would say it, right? I’m not getting it from regularly from you, like how it will ended up in S. 3.

190 00:24:06.910 00:24:08.880 Abhijith Thakur: So, as I said.

191 00:24:09.289 00:24:18.219 Abhijith Thakur: there was the data been was being expected by some other team. As we do not have the access. They were exporting the data into the Csv file, where I was.

192 00:24:18.220 00:24:23.659 Awaish Kumar: According to, were they also loading it to S. 3, or you were the one loading it first.st

193 00:24:23.660 00:24:42.369 Abhijith Thakur: Like I was connected to the Fp. Ftp. Server. So the Ftp. Library, where I was actually pulling the files from the Ftp. Server, and then I was once the file was downloaded locally over the Ftp. Server, Ftp. Then they were being hosted by these via both these libraries.

194 00:24:42.370 00:24:48.280 Awaish Kumar: So yeah, we we can move forward from here, like now to

195 00:24:49.810 00:24:52.090 Awaish Kumar: the data processing part of it.

196 00:24:52.750 00:25:05.490 Awaish Kumar: Oh, so in the data processing, for example, from our like, what?

197 00:25:06.060 00:25:13.910 Awaish Kumar: Basically, you get a file, you process pre-process it and you load it somewhere again in a history. But then, how

198 00:25:14.690 00:25:25.560 Awaish Kumar: that pre-processing is just cleaning and irrigation and things like that. And that’s okay. I’m not concerned about that. But then you have to do some data kind of data modeling

199 00:25:25.900 00:25:29.340 Awaish Kumar: so that end users can use it for analytical purposes.

200 00:25:30.380 00:25:36.640 Awaish Kumar: And were you the responsible for that modeling work or someone else.

201 00:25:38.560 00:25:40.400 Abhijith Thakur: The data modeling part.

202 00:25:40.730 00:25:43.930 Abhijith Thakur: Yes, I mean, the

203 00:25:46.130 00:26:02.900 Abhijith Thakur: I mean, I was responsible for the data modeling across the data, modeling part from end to end, like processing and training the data models. And then preparing the data sets to the downstream analytics for analytic dashboards and the.

204 00:26:03.800 00:26:09.170 Awaish Kumar: So what is the difference between star schema and Snowflake? Schema.

205 00:26:10.190 00:26:11.220 Abhijith Thakur: So

206 00:26:11.500 00:26:20.236 Abhijith Thakur: basically talking about a star star and snowflake, I mean, the the both these types of schemas would be like

207 00:26:21.340 00:26:23.649 Abhijith Thakur: like in the start. Schema has a

208 00:26:24.160 00:26:39.510 Abhijith Thakur: the dimensions which is simply put, like the flat tables, join them into directly to fetch the table, whereas the Snowflake schema has actually normalized the dimensionals where these dimensionals are broken into into the related subdimensional tables.

209 00:26:40.550 00:26:46.070 Abhijith Thakur: So the pros and cons of so I mean.

210 00:26:46.070 00:26:52.589 Awaish Kumar: You can define. So if you can define some pros and cons of both, and maybe given, maybe give an example of like

211 00:26:53.350 00:26:55.699 Awaish Kumar: use cases like when to use

212 00:26:55.840 00:26:58.370 Awaish Kumar: snowflake schema, and when to use star script.

213 00:26:58.900 00:27:01.463 Abhijith Thakur: When to use snowflake, I went to

214 00:27:02.610 00:27:04.300 Abhijith Thakur: Yes, so

215 00:27:05.000 00:27:32.759 Abhijith Thakur: for a start. Schema it is it has like a faster query, processing faster query performance. And basically it is simpler to understand, like the flat, intuitive structure for analyst and dashboard users. And then it is optimized for the Api type transform transactions, and then, which is ideal for slicing and aggregation in in reporting use case.

216 00:27:32.790 00:27:45.545 Abhijith Thakur: it is good for ad hoc resources. And like the cons for the star scheme start schema would be like. There is data written and then sees like some

217 00:27:46.110 00:27:49.980 Abhijith Thakur: dimensional info may be repeated across many rows.

218 00:27:55.150 00:27:56.060 Abhijith Thakur: and then.

219 00:27:56.060 00:27:56.680 Awaish Kumar: Okay.

220 00:27:59.280 00:28:00.860 Abhijith Thakur: Basically like

221 00:28:01.020 00:28:25.379 Abhijith Thakur: for Snowflake schema. If I speak, if I talk about the process, then it has, like the data redundancy it provides and improves the data integrity, and then it has. It is more scalable, and it has the smaller storage. And then the disadvantage of Snowflake schema would be like this has. It is lower query performance and has

222 00:28:25.867 00:28:36.360 Abhijith Thakur: more joints needed. So it is like it. It really requires a longer query, but it has complex joins, less intuitive for end users.

223 00:28:40.840 00:28:50.920 Awaish Kumar: So like snowflake steamer, and like, for example, when he was tasked, which is faster getting performance. So

224 00:28:51.290 00:28:54.969 Awaish Kumar: there’s also like flat table schema

225 00:28:55.310 00:29:02.109 Awaish Kumar: that’s going to give even more faster, careful focus. Why why not use flat table schema?

226 00:29:02.400 00:29:05.209 Awaish Kumar: And why we use starship.

227 00:29:06.260 00:29:07.499 Abhijith Thakur: Why do we use?

228 00:29:07.900 00:29:09.059 Abhijith Thakur: I mean so.

229 00:29:09.060 00:29:12.560 Awaish Kumar: If I if I need a faster security performance, I could just use

230 00:29:13.270 00:29:17.140 Awaish Kumar: flexible schema as well. So why why not.

231 00:29:19.540 00:29:24.799 Abhijith Thakur: I mean. In general, if you talk about these tasks.

232 00:29:25.650 00:29:34.519 Abhijith Thakur: the stars star schema it gives of like. It is not ideal for long term

233 00:29:35.310 00:29:39.230 Abhijith Thakur: the flat table schema, because for.

234 00:29:40.070 00:29:44.790 Abhijith Thakur: because due to its data redundancy and the maintenance issues

235 00:29:45.040 00:29:56.120 Abhijith Thakur: the storage and on also poor data governance. I don’t think like the flat schema table is not ideal.

236 00:29:57.630 00:30:07.040 Awaish Kumar: Okay? And then, so what is what are the like? As slowly changing dimensions.

237 00:30:07.760 00:30:09.479 Abhijith Thakur: Slowly changing dimensions.

238 00:30:10.210 00:30:11.020 Awaish Kumar: Yeah.

239 00:30:14.230 00:30:19.695 Abhijith Thakur: Like slowly changing dimensions basically are

240 00:30:20.600 00:30:41.524 Abhijith Thakur: the dimensional tables which attributes where the attributes attribute values change slowly over the time. And then we need to decide like how to track and and store these changes. So basically, they are different types of there are different types of slowly changing dimensions. And

241 00:30:42.817 00:30:50.430 Abhijith Thakur: like, type 0 type one and type 2, type 3 and until type 4 and 6,

242 00:30:53.675 00:30:54.310 Abhijith Thakur: okay.

243 00:30:54.740 00:30:57.410 Awaish Kumar: What are like, what is 0? 1, 2.

244 00:30:58.700 00:31:00.050 Abhijith Thakur: What is the difference?

245 00:31:01.030 00:31:04.339 Awaish Kumar: There’s 3 times, you know what’s type one type, 2.

246 00:31:04.340 00:31:14.761 Abhijith Thakur: Yeah. So type 0 is where no changes are allowed, where we keep the original data as it is. And then when we keep the original

247 00:31:15.820 00:31:36.169 Abhijith Thakur: let’s say customer birth related even if the and and the even if the enter records incorrectly, whereas type. One is like where we will be overwriting the old data where the history is being maintained. So the customer address like changes the address, and it updates the address directly.

248 00:31:36.170 00:31:51.039 Abhijith Thakur: and then the type 2 would be. There will be a creating a new row for every change. It keeps the full history, and then a new row added with new address, and the old one kept

249 00:31:51.330 00:31:52.760 Abhijith Thakur: would be ended.

250 00:31:55.310 00:32:04.080 Awaish Kumar: Okay? And apart from us, like, apart from that, have you ever used hereflow.

251 00:32:05.500 00:32:07.990 Abhijith Thakur: Have I ever used airflow, abache airflow.

252 00:32:11.020 00:32:11.840 Awaish Kumar: Yeah.

253 00:32:12.090 00:32:18.961 Abhijith Thakur: Yes, I mean, basically. I have used it as an

254 00:32:20.020 00:32:30.120 Abhijith Thakur: like orchestration tool. Where scheduled orchestrated, I scheduled and most orchestrated, and then monitor the data pipelines, particularly for.

255 00:32:30.120 00:32:32.630 Awaish Kumar: What is? What is the architecture of airflow.

256 00:32:34.120 00:32:36.270 Abhijith Thakur: Architecture of airflow.

257 00:32:36.940 00:32:41.898 Abhijith Thakur: So if I speak about the architecture, then

258 00:32:44.070 00:33:11.960 Abhijith Thakur: it has a modular distributed architecture with its like, it has 4 main components where we have a scheduler which decides what to run. And and then it has an executor where we run the task, and then it has a web server where the ui which is being managed for and then we have metadata database where we store the dag files that stages and the task history.

259 00:33:14.120 00:33:14.830 Awaish Kumar: Okay?

260 00:33:15.760 00:33:24.208 Awaish Kumar: And what, for example, if I have a table and I need, I have a task running in airflow, and it gets stuck for

261 00:33:27.170 00:33:29.879 Abhijith Thakur: What is the best way to handle that.

262 00:33:30.260 00:33:35.242 Awaish Kumar: Which sometimes runs sometimes when’s okay

263 00:33:36.400 00:33:45.250 Awaish Kumar: successfully, and then the further down the pipeline was cake. But sometimes it just gets stuck, and all my pipeline gets stuck as well.

264 00:33:46.160 00:33:48.570 Awaish Kumar: So what is the best way to handle that.

265 00:33:51.320 00:33:56.269 Abhijith Thakur: I would say, like the best way to handle these

266 00:33:56.470 00:34:06.520 Abhijith Thakur: and man, handle and manage would be, we can mark we. We can market like the field of

267 00:34:07.125 00:34:17.529 Abhijith Thakur: manually from the ui or the cli, and then we can enable the timeout which retides and alerts in our

268 00:34:18.840 00:34:20.100 Abhijith Thakur: and then.

269 00:34:20.100 00:34:20.670 Awaish Kumar: Can I.

270 00:34:21.030 00:34:21.870 Abhijith Thakur: We can use it.

271 00:34:21.870 00:34:22.959 Awaish Kumar: Second time, on.

272 00:34:23.489 00:34:24.269 Abhijith Thakur: Sorry.

273 00:34:24.719 00:34:26.419 Awaish Kumar: How can I set the time out.

274 00:34:26.820 00:34:30.230 Abhijith Thakur: How can I? I mean

275 00:34:32.737 00:34:36.430 Abhijith Thakur: timeouts in order to set the timeouts

276 00:34:40.050 00:34:46.420 Abhijith Thakur: I can say this platform where in Apache airflow.

277 00:34:48.389 00:35:07.150 Abhijith Thakur: I think there is an option like where we can basically specify the execution timeout. So we are handling like, there’s a diagram timeout where basically, it provides a maximum duration of the entire like for the entire duration.

278 00:35:08.930 00:35:11.029 Awaish Kumar: I’m asking about a single task.

279 00:35:11.300 00:35:18.550 Awaish Kumar: I have a pipeline which has 20 tasks. I don’t want to expire everything I just want. There’s a task which got stuck

280 00:35:18.800 00:35:24.439 Awaish Kumar: for more than a 60 second just gets skipped and everything else runs.

281 00:35:27.493 00:35:30.879 Abhijith Thakur: For a specific task, I would say.

282 00:35:31.696 00:35:34.400 Abhijith Thakur: If I want to set it up

283 00:35:34.630 00:35:42.210 Abhijith Thakur: and then, like for a specific task, I will be taking the help of

284 00:35:43.166 00:36:00.990 Abhijith Thakur: like execution. Timeout it is, which is equal to the time details, and then I can specify the seconds equals to 60 to kill the task. And then, if if it runs for too long, then basically we can have a failure. Call back to the mark of the task, and then.

285 00:36:00.990 00:36:03.810 Awaish Kumar: Where you can write. This second is equal to 60.

286 00:36:04.500 00:36:05.390 Abhijith Thakur: Sorry.

287 00:36:05.760 00:36:08.340 Awaish Kumar: Where will you write this seconds equal to 60.

288 00:36:08.340 00:36:13.810 Abhijith Thakur: Where will I write this second equal to 60? In

289 00:36:15.165 00:36:23.179 Abhijith Thakur: like, these needs to be written inside the configuration file. So usually we we like

290 00:36:23.570 00:36:32.180 Abhijith Thakur: we write it inside at the task definition, or as a parameter of the operator. Let’s suppose the python operator or a batch

291 00:36:33.000 00:36:34.580 Abhijith Thakur: like a bash operator.

292 00:36:35.850 00:36:40.920 Awaish Kumar: But you are mentioning both things you’re saying. We can write in configuration, and we can write in

293 00:36:41.860 00:36:45.360 Awaish Kumar: task like what that means.

294 00:36:45.680 00:36:55.599 Abhijith Thakur: No, I mean, we will be basically writing inside the like in in the ask definition, mainly in the task definition.

295 00:36:58.020 00:37:01.350 Awaish Kumar: How can I write it? That’s what that’s my question. What is the name?

296 00:37:04.380 00:37:08.879 Awaish Kumar: If I could write operator then like, what? How should I write operator.

297 00:37:09.200 00:37:13.589 Awaish Kumar: but this, you know, like the time expiry time.

298 00:37:15.649 00:37:26.049 Abhijith Thakur: I could say the execution timeout is an like an optional parameter in the airflow which.

299 00:37:26.050 00:37:27.189 Awaish Kumar: Is that a parameter.

300 00:37:27.190 00:37:30.680 Abhijith Thakur: Which is usually to limit limit the maximum.

301 00:37:30.680 00:37:32.709 Awaish Kumar: Execution. Timeout is a parameter.

302 00:37:34.330 00:37:38.500 Abhijith Thakur: Inside. I can. I mean, I can write it inside the task definition. So it is an.

303 00:37:38.500 00:37:41.629 Awaish Kumar: But that’s that’s my question. Like, when you define a task

304 00:37:42.000 00:37:45.370 Awaish Kumar: is execution. Timeout is a valid parameter.

305 00:37:46.570 00:37:47.910 Abhijith Thakur: Is execution.

306 00:37:48.350 00:37:52.990 Abhijith Thakur: No, I mean, that’s an optional parameter. Yeah.

307 00:37:53.810 00:37:55.350 Awaish Kumar: Is this even a parameter.

308 00:37:56.320 00:37:59.399 Abhijith Thakur: We can set it up for this condition.

309 00:38:00.470 00:38:01.270 Abhijith Thakur: So.

310 00:38:01.270 00:38:04.839 Awaish Kumar: It’s not a valid meter, it’s not a valid like you can’t write execution. Type of.

311 00:38:05.710 00:38:09.489 Awaish Kumar: There’s no way air flow doesn’t understand that Paramil.

312 00:38:10.500 00:38:15.329 Abhijith Thakur: I mean it is optional. I don’t think it is. It is mandatory.

313 00:38:15.330 00:38:21.589 Awaish Kumar: There’s a whole being like as like airflow. Given an option to write an

314 00:38:21.850 00:38:29.840 Awaish Kumar: task, vital operator, insider of operator, there are some parameters, and they are defined already. What you can write and what you can’t write.

315 00:38:30.020 00:38:35.330 Awaish Kumar: you know. Just see anything. And it’s not only there.

316 00:38:37.760 00:38:47.750 Awaish Kumar: Okay, I am. I had like amount. I’m done with my all the technical questions. So now, if you would like to

317 00:38:48.980 00:38:51.540 Awaish Kumar: ask anything, yeah, we can.

318 00:38:52.882 00:38:54.419 Abhijith Thakur: I mean, I

319 00:38:55.230 00:39:14.189 Abhijith Thakur: like to answer the to the previous question, like, in the case of execution, the timeout is actually actually the keyword which we will, you will be using to task definitions where you can, where we have execution timeout. And then it is officially being supported.

320 00:39:14.190 00:39:20.519 Awaish Kumar: That’s not a parameter, either. So let’s just move ahead.

321 00:39:21.060 00:39:22.650 Awaish Kumar: I just want to.

322 00:39:23.340 00:39:27.299 Awaish Kumar: Yeah, I’m like, I’m done with my all the technical questions. I’m just want to know.

323 00:39:27.440 00:39:34.630 Awaish Kumar: I have like some time, if you want to ask something about brain force and the company and the job role and something.

324 00:39:35.663 00:39:43.890 Abhijith Thakur: Yes, I would definitely like to know about the the role and the responsibilities.

325 00:39:45.020 00:39:47.990 Awaish Kumar: Can you help you apply for ae engineer.

326 00:39:50.320 00:39:50.940 Abhijith Thakur: I’m sorry.

327 00:39:50.940 00:39:53.450 Awaish Kumar: Did you apply for analytics, engineer or.

328 00:39:55.610 00:40:00.680 Abhijith Thakur: I did not apply. Actually, I was being referred by Uttam.

329 00:40:03.070 00:40:12.899 Awaish Kumar: Okay, okay, okay, I actually in our emails, like, there was no mention of job boards. I.

330 00:40:12.900 00:40:13.580 Abhijith Thakur: Yeah.

331 00:40:13.840 00:40:18.220 Awaish Kumar: Looks good, confused. But yeah, that’s okay. So

332 00:40:18.400 00:40:25.350 Awaish Kumar: like as an AI like, he like, we have a data engineer analytics, engineer position where? Basically alright

333 00:40:26.282 00:40:45.280 Awaish Kumar: we like, you are assigned a client work. And then we do engineering work. We are using Dexter, for example, for overwriting our data pipelines. And then we have a Dbt we use DVD for all of our modeling work. And then basically, it depends on the client to client for data warehouse we are going to use.

334 00:40:45.400 00:40:49.899 Awaish Kumar: It can be snowflake, typically or a collapse shift like.

335 00:40:51.590 00:41:00.580 Awaish Kumar: And then we yeah, that’s mainly the kind of like we use python for our dexter pipeline use SQL. Basically as part of.

336 00:41:19.620 00:41:22.250 Awaish Kumar: I just got disconnected. But yeah, the

337 00:41:22.610 00:41:25.218 Awaish Kumar: yeah, I was saying that

338 00:41:29.040 00:41:35.500 Awaish Kumar: I was just telling that like, yeah, as a data engineer, we have it. Like, we have

339 00:41:35.770 00:41:42.790 Awaish Kumar: kind of a lot of clients which basically need D or ke help.

340 00:41:42.900 00:42:06.819 Awaish Kumar: So we are basically using python mainly as our programming language for writing all of our data pipelines, and then we use SQL. As part of Dbt. To write about all the Dbt modeling work, and we are quite a. We use quite a lot of ingestion tools, basically, for example, 5 trend polytomic portable, and there are different other

341 00:42:07.010 00:42:13.385 Awaish Kumar: Ctp tools as well like segment. So these are kind of tools are being used to

342 00:42:14.760 00:42:17.370 Awaish Kumar: and just the data. And

343 00:42:20.210 00:42:26.519 Awaish Kumar: then we lose all our Dbt modeling work. And then it lands into some warehouse, which is

344 00:42:26.670 00:42:30.424 Awaish Kumar: snowflakes. Sometimes it’s big carry and

345 00:42:31.460 00:42:55.210 Awaish Kumar: that’s mainly the the role of de here and basically, the person is going to be assigned to one or 2 or 3 different clients. And and like, basically, we know, like how much efforts we got wanna put for each week for for each client. And then that way, we are going to handle the

346 00:42:56.060 00:43:02.255 Awaish Kumar: that. And we are using like, for example, linear for our project as our project management tool. And

347 00:43:02.730 00:43:16.159 Awaish Kumar: yeah, that’s that’s all. And like for brain about brain force. You already know that, like, it’s a consultancy from consulting firm which basically provides data. And AI consultancy services.

348 00:43:16.270 00:43:22.719 Awaish Kumar: And yeah, that’s much. That’s many.

349 00:43:24.190 00:43:25.015 Abhijith Thakur: Okay,

350 00:43:27.660 00:43:34.210 Abhijith Thakur: I mean, talking about the interview like, where do you think I can improve on.

351 00:43:43.320 00:43:43.710 Awaish Kumar: Sorry.

352 00:43:44.760 00:43:45.540 Abhijith Thakur: Hello!

353 00:43:47.160 00:43:47.930 Awaish Kumar: Hello!

354 00:43:48.530 00:43:59.650 Abhijith Thakur: Yes, so I was. I was saying, key like, talking about this interview, can you? like, help me understand where? On which areas where I could improve myself.

355 00:44:10.255 00:44:14.099 Awaish Kumar: I couldn’t hear you properly. Yeah.

356 00:44:14.100 00:44:21.039 Abhijith Thakur: I mean, I don’t think I can see an network issue from your side where you know your network has been fluctuating

357 00:44:22.270 00:44:22.930 Abhijith Thakur: right.

358 00:44:26.170 00:44:28.669 Awaish Kumar: Is, am I? Am I audible?

359 00:44:31.070 00:44:32.850 Abhijith Thakur: No, I can’t hear you properly.

360 00:45:29.820 00:45:35.169 Awaish Kumar: Yeah. Am I audible? Now? You can like close your camera. Maybe that would be better.

361 00:45:36.190 00:45:38.019 Abhijith Thakur: Yeah, yes, I can hear you. Sorry.

362 00:45:38.020 00:45:40.440 Awaish Kumar: Okay, no.

363 00:45:41.160 00:45:43.029 Awaish Kumar: So yeah, were you asking something?

364 00:45:43.070 00:45:49.640 Abhijith Thakur: Like I was saying, talking about this interview. Where do you think? Where I could improve on my side?

365 00:45:52.546 00:45:53.103 Awaish Kumar: Like

366 00:45:54.710 00:46:06.020 Awaish Kumar: And obviously anyone can improve on anything. But that’s okay. I like our our Rico from our operations team is going to get back to you

367 00:46:06.742 00:46:09.707 Awaish Kumar: with with our feedback and

368 00:46:10.866 00:46:18.500 Awaish Kumar: and the next steps, yeah. And I I think it can’t. It’s Friday. So yeah, it might be in the next week.

369 00:46:19.280 00:46:21.529 Abhijith Thakur: Okay. Okay. Sounds, good. Yeah.

370 00:46:21.800 00:46:23.500 Awaish Kumar: Okay. Thank you.

371 00:46:23.650 00:46:25.209 Abhijith Thakur: And yeah, thank you.