2025-07-29_data_engineer_interview_abhijith_thakur

Meeting Title: Data Engineer Interview (Abhijith Thakur) Date: 2025-07-29 Meeting participants: Abhijith Thakur, Awaish Kumar

WEBVTT

1 00:00:53.490 ⇒ 00:00:54.400 Abhijith Thakur: Hello!

2 00:00:55.050 ⇒ 00:00:55.810 Awaish Kumar: Hello!

3 00:00:56.350 ⇒ 00:00:58.060 Abhijith Thakur: Hi hi arish.

4 00:00:58.990 ⇒ 00:00:59.660 Awaish Kumar: Right.

5 00:00:59.790 ⇒ 00:01:00.870 Abhijith Thakur: Good morning!

6 00:01:02.283 ⇒ 00:01:03.899 Awaish Kumar: Good morning. How are you?

7 00:01:03.900 ⇒ 00:01:06.699 Abhijith Thakur: I’m doing great. Can you see my video?

8 00:01:07.530 ⇒ 00:01:08.730 Awaish Kumar: Yes, I can.

9 00:01:08.730 ⇒ 00:01:11.879 Abhijith Thakur: Oh, yes, I’m doing great. How about you?

10 00:01:12.760 ⇒ 00:01:16.229 Awaish Kumar: I’m good as well. So where are you located?

11 00:01:16.410 ⇒ 00:01:19.519 Abhijith Thakur: I am currently located in Houston, Texas.

12 00:01:21.220 ⇒ 00:01:25.589 Awaish Kumar: Okay, and so like, it’s the eastern time zone. Right?

13 00:01:26.030 ⇒ 00:01:27.590 Abhijith Thakur: CST. Yes.

14 00:01:28.800 ⇒ 00:01:29.420 Awaish Kumar: Okay?

15 00:01:31.160 ⇒ 00:01:34.989 Awaish Kumar: Yeah, I will share what? What will be the agenda of this meeting?

16 00:01:35.170 ⇒ 00:01:35.880 Awaish Kumar: And

17 00:01:37.355 ⇒ 00:01:47.114 Awaish Kumar: and introduce myself and about a company. And then, yeah, you can start with your introduction, and then we will deep dive into further, like

18 00:01:47.910 ⇒ 00:02:05.609 Awaish Kumar: technical discussion about different projects. You might have worked on. So my name is Amesh Kumar. I’m data engineering manager here at Brainforge Brain Forge. And I have like around 8 to 10 years of experience working, you know, as a kind of full stack data engineer.

19 00:02:06.560 ⇒ 00:02:08.000 Awaish Kumar: And then the

20 00:02:09.050 ⇒ 00:02:16.330 Awaish Kumar: and and at Rainforge, basically, we are a consulting company providing certain data and AI services to

21 00:02:17.011 ⇒ 00:02:19.820 Awaish Kumar: different clients in different industries.

22 00:02:20.840 ⇒ 00:02:22.490 Awaish Kumar: So mainly the

23 00:02:23.780 ⇒ 00:02:34.568 Awaish Kumar: and like the the we, the brain forge kind of operates like. With people like remotely like, everybody is working remote from

24 00:02:35.190 ⇒ 00:02:37.960 Awaish Kumar: almost everywhere in the world.

25 00:02:38.820 ⇒ 00:02:43.680 Awaish Kumar: So some most of our clients are from us. So yeah, it would be nice if

26 00:02:43.840 ⇒ 00:02:47.649 Awaish Kumar: for for people in us. So they can work in their own.

27 00:02:48.350 ⇒ 00:02:50.549 Awaish Kumar: They can manage their hours.

28 00:02:51.167 ⇒ 00:03:00.180 Awaish Kumar: And yeah, like, the plane also, like Rainforge has different kind of phone.

29 00:03:01.260 ⇒ 00:03:02.850 Awaish Kumar: How they they

30 00:03:03.330 ⇒ 00:03:10.190 Awaish Kumar: hire people like with the Amazon team like they can hire full time, part time and depending on the

31 00:03:11.140 ⇒ 00:03:13.820 Awaish Kumar: if everyone’s needs.

32 00:03:14.260 ⇒ 00:03:20.750 Awaish Kumar: Yeah. So that’s basically what we are doing and how we are operating here.

33 00:03:20.880 ⇒ 00:03:23.219 Abhijith Thakur: Now you can introduce yourself.

34 00:03:24.030 ⇒ 00:03:24.830 Abhijith Thakur: Okay?

35 00:03:25.140 ⇒ 00:03:32.170 Abhijith Thakur: So yes, I did understand about your company. And how do we work? But talking about me.

36 00:03:32.760 ⇒ 00:03:57.639 Abhijith Thakur: I am a recent graduate. I have finished my master’s in computer science at Lamar University, and I have graduated in December last December. I have around 3 plus of years experience, hands-on experience. I can say, with working with the data everything from building the Etl pipelines and data models and developing dashboards

37 00:03:57.770 ⇒ 00:04:01.480 Abhijith Thakur: and also collaborating with different cross functional teams.

38 00:04:01.790 ⇒ 00:04:23.029 Abhijith Thakur: And I have worked with companies like my recent like, currently Cigna healthcare, which is my most recent client, and also had to get got a chance to work with the Tcs and Hexa, where in India and where I’ve been handled real world projects involving in healthcare, telecom and insurance data domains.

39 00:04:23.030 ⇒ 00:04:33.959 Abhijith Thakur: Most of my work revolves around python, SQL. And aws and tools like power bi and tableau for data analytics.

40 00:04:34.400 ⇒ 00:04:48.502 Abhijith Thakur: So in this cigna healthcare, I have been focusing on automating the workflows, improving the predictive models for the care cost while

41 00:04:49.330 ⇒ 00:04:58.250 Abhijith Thakur: building a real time, like solving problems end to end, understanding the business content and handling the messy data.

42 00:04:58.410 ⇒ 00:05:05.030 Abhijith Thakur: delivering something that actually drives decisions. So yes, this is all about me.

43 00:05:09.480 ⇒ 00:05:17.530 Awaish Kumar: Yeah. So you mentioned, you are recent graduate. And then you mentioned, you have 3 years of experience working as a data person. So like

44 00:05:17.830 ⇒ 00:05:24.540 Awaish Kumar: like, did you work full time, or like part time with your masters? Or yes, so

45 00:05:24.540 ⇒ 00:05:26.820 Awaish Kumar: experience before your master’s.

46 00:05:27.100 ⇒ 00:05:50.651 Abhijith Thakur: Yes, so I did. Had my experience before Masters. So while I was working like while I was studying my bachelor’s in India, I got a chance to work with technologies, and then later on as an intern. And then again, I switched in between to in switch to Tcs, where I was working with working as a data modeler. And

47 00:05:51.540 ⇒ 00:05:59.410 Abhijith Thakur: after that I have come to the Us. To pursue my master’s and been.

48 00:05:59.410 ⇒ 00:06:01.779 Awaish Kumar: How long you worked at this.

49 00:06:02.390 ⇒ 00:06:08.570 Abhijith Thakur: Tcs. I can say around 1.6 to 1 1.6 months, I guess. Yes.

50 00:06:08.570 ⇒ 00:06:12.699 Awaish Kumar: And the data. Okay, so what you did as a data modeler.

51 00:06:13.100 ⇒ 00:06:15.645 Abhijith Thakur: Data modeler. So we had to.

52 00:06:16.120 ⇒ 00:06:44.622 Abhijith Thakur: we were actually working with Stc, which is a Dubai client where at in telecom domain. So we had like. We had to take over the Ibm project where we were asked to build a data enterprise logical data models for the all the business applications with Stc, which is Saudi telecom, where we have been using informatica urban as a data modeling tool and

53 00:06:45.270 ⇒ 00:06:55.099 Abhijith Thakur: data cleaning metadata. And all this stuff, but more precisely, was into building enterprise, logical data, models.

54 00:06:55.840 ⇒ 00:06:56.879 Awaish Kumar: Oh, yeah, thank you.

55 00:06:57.732 ⇒ 00:07:02.569 Awaish Kumar: Well, I want to hear hear more about data modeling like, what different data modeling

56 00:07:02.680 ⇒ 00:07:07.380 Awaish Kumar: to applied or like what you did as a data modeler

57 00:07:09.020 ⇒ 00:07:22.209 Awaish Kumar: like Overall, you might have worked on different things, right? As a report thing which you are saying that you’re more involved in modeling the data itself. Right? I want to just talk about that like how

58 00:07:22.450 ⇒ 00:07:29.960 Awaish Kumar: how the data was coming in, how you modeled it, what kind of different modeling technique. You did. You used things like that.

59 00:07:30.480 ⇒ 00:07:32.500 Abhijith Thakur: Okay, so

60 00:07:32.700 ⇒ 00:07:44.322 Abhijith Thakur: while working with the Tcs, yeah, yes, we, my, most of the work was related to data data modeling. And we were given some

61 00:07:45.350 ⇒ 00:07:52.540 Abhijith Thakur: work were involved in creating. I’m sorry I’m sorry.

62 00:07:53.220 ⇒ 00:07:55.469 Abhijith Thakur: So where we were given

63 00:07:59.290 ⇒ 00:08:23.139 Abhijith Thakur: like we were involved in creating a logical and physical data models to support the business reports. And the business analytics. The source data was coming from multiple systems. The telecom usage, the Crm systems, the building databases. And we were dealing with. Let’s say, millions of records every month. And we’ve been able to.

64 00:08:23.140 ⇒ 00:08:30.175 Awaish Kumar: Yeah, yeah, sorry to interrupt. But like, I want to hear more about like you names. Few things.

65 00:08:30.620 ⇒ 00:08:34.710 Awaish Kumar: on a business level, like some databases. But I want to see like

66 00:08:35.039 ⇒ 00:08:37.369 Awaish Kumar: I want to hear more about technical terms.

67 00:08:38.370 ⇒ 00:08:39.150 Awaish Kumar: Like

68 00:08:40.030 ⇒ 00:08:48.240 Awaish Kumar: you use like Crm tool like. But what what Crm. Tool. Was it internal, or is it?

69 00:08:48.420 ⇒ 00:08:54.970 Awaish Kumar: Was it something like 3rd party tool, or what databases you use like postgres, or

70 00:08:55.490 ⇒ 00:08:59.000 Awaish Kumar: or like my my school server, what what it was.

71 00:08:59.000 ⇒ 00:09:17.539 Abhijith Thakur: So generally the database which we have used was Myskill, and we had to pull. We used to get the metadata from informatica. And we used to collect the data, do some data cleaning. And then

72 00:09:17.770 ⇒ 00:09:38.817 Abhijith Thakur: the data systems were coming from like the data streams were coming from Oss systems. And the client didn’t use any 3rd party like data warehousing tools. So we build most of the pipelines using SQL and the core data warehouse for reporting and layering. We used

73 00:09:39.760 ⇒ 00:09:44.210 Abhijith Thakur: Amazon redshift. And yeah.

74 00:09:46.620 ⇒ 00:09:55.260 Awaish Kumar: Okay? Well, okay. So mostly you said the data coming from my school, and then it going to some

75 00:09:55.700 ⇒ 00:09:56.820 Awaish Kumar: hdf.

76 00:09:59.700 ⇒ 00:10:02.190 Awaish Kumar: Like, what is the cliff.

77 00:10:03.000 ⇒ 00:10:04.210 Abhijith Thakur: I’m sorry.

78 00:10:04.650 ⇒ 00:10:05.990 Awaish Kumar: What is hdf.

79 00:10:06.720 ⇒ 00:10:14.140 Abhijith Thakur: And like I mentioned that the data was, data. Streams were coming from Ob Oss. System.

80 00:10:16.320 ⇒ 00:10:17.090 Awaish Kumar: Okay.

81 00:10:17.350 ⇒ 00:10:21.970 Abhijith Thakur: And we have been using

82 00:10:27.120 ⇒ 00:10:28.850 Awaish Kumar: SQL.

83 00:10:30.030 ⇒ 00:10:33.230 Abhijith Thakur: Like database.

84 00:10:35.800 ⇒ 00:10:44.310 Awaish Kumar: Like. That’s what I’m asking. I am. I’m not sure about this database like what it does, how it operates things like that. I want to know more about it.

85 00:10:44.790 ⇒ 00:10:47.140 Abhijith Thakur: You want to know more about Hdfs.

86 00:10:48.660 ⇒ 00:10:53.029 Abhijith Thakur: The Hdfs is a file system, right? It’s how it was being.

87 00:10:53.170 ⇒ 00:10:58.850 Awaish Kumar: Like we were using adobe as a We have a house, or

88 00:10:59.290 ⇒ 00:11:04.689 Awaish Kumar: but like the Cdfs file system, was being used to transfer your data

89 00:11:04.810 ⇒ 00:11:07.860 Awaish Kumar: to somewhere. Some other database like, how.

90 00:11:08.230 ⇒ 00:11:14.500 Abhijith Thakur: I mean using py spark on databricks building they taking the raw.

91 00:11:14.500 ⇒ 00:11:19.279 Awaish Kumar: I, smart is mostly used for processing. It’s not a storage.

92 00:11:19.530 ⇒ 00:11:23.300 Awaish Kumar: How does it compares? We’re going at to be stored.

93 00:11:26.060 ⇒ 00:11:27.659 Abhijith Thakur: Storage of data.

94 00:11:29.650 ⇒ 00:11:38.099 Awaish Kumar: Like some data is coming from somewhere. You process it using by spark, by spark. Then finally, it goes somewhere, right. It lands somewhere in some database.

95 00:11:38.300 ⇒ 00:11:39.110 Abhijith Thakur: Yes.

96 00:11:39.950 ⇒ 00:11:41.620 Awaish Kumar: And what it is.

97 00:11:41.820 ⇒ 00:11:42.420 Abhijith Thakur: Does it?

98 00:11:42.420 ⇒ 00:11:43.300 Awaish Kumar: Destination.

99 00:11:43.300 ⇒ 00:11:45.000 Abhijith Thakur: Redshift, Amazon, redshift.

100 00:11:45.640 ⇒ 00:11:55.360 Awaish Kumar: Okay? So now it’s it’s it’s about you mentioned of a pipeline where data is coming from source, how you process it and where it claims.

101 00:11:55.850 ⇒ 00:12:02.909 Awaish Kumar: Now, when it lands somewhere, there comes the data modeling right? It was just a retail pipeline.

102 00:12:03.100 ⇒ 00:12:03.650 Awaish Kumar: Yes.

103 00:12:03.650 ⇒ 00:12:16.749 Awaish Kumar: the thing we talk. It’s a nickel management. But when it is going to some data warehouse there, we need some data modeling, right? How you are doing the story, how you are going to process it further.

104 00:12:17.610 ⇒ 00:12:23.820 Awaish Kumar: So now, like, let’s discuss about that. What kind of data modeling you used like.

105 00:12:25.440 ⇒ 00:12:26.679 Abhijith Thakur: What I’m trying to.

106 00:12:26.680 ⇒ 00:12:29.290 Awaish Kumar: Data, modeling techniques. You utilize.

107 00:12:29.620 ⇒ 00:12:34.099 Abhijith Thakur: Kind of data modeling. So the 1st thing

108 00:12:34.550 ⇒ 00:12:56.384 Abhijith Thakur: we have been like for, especially for creating, we were like the business requirement was to create enterprise, logical data models and we were using Erwin as in a data modeling tool for designing these data data models, but

109 00:12:57.640 ⇒ 00:12:58.690 Abhijith Thakur: like

110 00:12:59.080 ⇒ 00:13:16.380 Abhijith Thakur: the firstly, the pipeline where the raw data is coming from a different Csv files. And in the in the databases. We used to pull the data using Mysql, and then these

111 00:13:16.440 ⇒ 00:13:30.772 Abhijith Thakur: are formatted and dropped and like stored in the Aws S. 3. And then we by using the py spark to py spark, we read the data apply data transformations, data cleaning and stuff. And then we

112 00:13:31.910 ⇒ 00:13:34.219 Abhijith Thakur: output that we transform the data.

113 00:13:34.910 ⇒ 00:13:41.560 Awaish Kumar: Yeah, like, are you still more want to hear about like enterprise? Logical data modeling is kind of a

114 00:13:42.120 ⇒ 00:13:47.940 Awaish Kumar: like like. I get it somewhere. But

115 00:13:48.876 ⇒ 00:13:51.210 Awaish Kumar: monological view of your databases.

116 00:13:52.140 ⇒ 00:13:52.690 Abhijith Thakur: Right?

117 00:13:53.620 ⇒ 00:14:14.685 Awaish Kumar: Which is platform independent. Right? You don’t need. You don’t care if at that moment, like what kind of what kind of databases or you are going to use, or what kind of warehouses you are going to use. You are. You have designed a logical view of like, what are the core entities, or how they are being joined, or like a kind of

118 00:14:15.270 ⇒ 00:14:16.080 Abhijith Thakur: Yes.

119 00:14:17.530 ⇒ 00:14:24.990 Awaish Kumar: Logical view of your entire system, along with what attributes there will be, or things like that.

120 00:14:25.280 ⇒ 00:14:33.869 Awaish Kumar: But, like the this design, what? What was that design. Like, I mean, like, there are different

121 00:14:34.070 ⇒ 00:14:42.820 Awaish Kumar: modeling techniques like there are called dimensional data modeling in data warehouses. There is called data vault data modeling.

122 00:14:43.070 ⇒ 00:14:58.922 Abhijith Thakur: Yeah, I mean, we, we designed a dimensional models using these these star schema. Let’s say, for example, we had some fact sales with keys like a customer id product id and date id, and then

123 00:14:59.610 ⇒ 00:15:01.680 Abhijith Thakur: we had to

124 00:15:12.790 ⇒ 00:15:13.610 Abhijith Thakur: like.

125 00:15:13.610 ⇒ 00:15:16.436 Awaish Kumar: Okay, have you used any tools like

126 00:15:17.120 ⇒ 00:15:18.680 Awaish Kumar: Dbt.

127 00:15:20.430 ⇒ 00:15:30.510 Abhijith Thakur: Yes, I did use dpt like I did not use Dvt indirect, but I do. Know about Dbt.

128 00:15:32.360 ⇒ 00:15:35.860 Awaish Kumar: Okay. What is seed in Dbt.

129 00:15:36.370 ⇒ 00:15:38.199 Abhijith Thakur: A seed in DVD.

130 00:15:39.240 ⇒ 00:15:39.820 Awaish Kumar: Good.

131 00:15:41.120 ⇒ 00:15:44.210 Awaish Kumar: Thanks, the concept of seeds and liberty.

132 00:15:44.210 ⇒ 00:15:46.790 Abhijith Thakur: Seats, so.

133 00:15:52.840 ⇒ 00:15:53.620 Awaish Kumar: Okay.

134 00:15:54.600 ⇒ 00:15:57.369 Abhijith Thakur: So in DVD. What are.

135 00:15:58.710 ⇒ 00:16:08.759 Awaish Kumar: Like basically, you know, like, if I want to make some write some modular SQL,

136 00:16:09.180 ⇒ 00:16:12.930 Awaish Kumar: what should I do if I’m I’m in a DVD project.

137 00:16:13.120 ⇒ 00:16:16.540 Awaish Kumar: I want to write a murderer escape? Query

138 00:16:17.296 ⇒ 00:16:21.679 Awaish Kumar: because my current case is very, very messy large.

139 00:16:21.810 ⇒ 00:16:22.860 Awaish Kumar: I want to

140 00:16:23.130 ⇒ 00:16:40.379 Awaish Kumar: like like difficult concept of modularity in our programming languages, like in Python, or in any other language. I want to utilize that concept of popularity. To write my SQL. Query in a DVD project. So what I should be doing basically.

141 00:16:40.780 ⇒ 00:16:44.050 Abhijith Thakur: Load the data into the data warehouse.

142 00:16:45.390 ⇒ 00:16:46.110 Awaish Kumar: Sorry.

143 00:16:46.875 ⇒ 00:16:47.460 Abhijith Thakur: Like.

144 00:16:47.460 ⇒ 00:16:49.369 Awaish Kumar: Data is already in the data warehouse.

145 00:16:49.550 ⇒ 00:16:50.380 Abhijith Thakur: Okay.

146 00:16:51.100 ⇒ 00:16:52.440 Awaish Kumar: Did I? The

147 00:16:52.570 ⇒ 00:17:01.570 Awaish Kumar: Dbt works like this? Dbt, the, How does DVD works? DVD. Does not connect to multiple systems at the same time

148 00:17:01.700 ⇒ 00:17:09.379 Awaish Kumar: from a single project you have to connect only to a single database, and that is mostly the data warehouse where our data is located.

149 00:17:10.040 ⇒ 00:17:23.080 Awaish Kumar: The DVD is not ingestion tool. So we are going to pair it with some ingestion tool. The data comes in to a data warehouse after it comes with to the data warehouse. We are going to use. Dbt to further transform

150 00:17:23.250 ⇒ 00:17:29.100 Awaish Kumar: the data which is already resides in the data warehouse. Right? So it’s more like

151 00:17:29.740 ⇒ 00:17:32.090 Awaish Kumar: like, there are 2 paradigm like.

152 00:17:32.460 ⇒ 00:17:39.519 Awaish Kumar: there are 2 paradigms of data process, like the data loading like Etl, and empt.

153 00:17:39.750 ⇒ 00:17:40.440 Abhijith Thakur: Yes.

154 00:17:41.070 ⇒ 00:17:45.760 Awaish Kumar: And and normally, when people are using DVD. They utilize Elt.

155 00:17:47.400 ⇒ 00:17:56.010 Awaish Kumar: So now the data is already in their warehouse. We want to transform it. But while transforming, I wrote a query which has become really messy.

156 00:17:56.670 ⇒ 00:18:05.070 Awaish Kumar: Now, I want to like want to make it readable, like modular. So people understand what what it is doing and what are different parts of it

157 00:18:05.900 ⇒ 00:18:09.960 Awaish Kumar: so productivity, features I can use.

158 00:18:10.120 ⇒ 00:18:11.580 Awaish Kumar: Let’s do that.

159 00:18:11.580 ⇒ 00:18:16.710 Abhijith Thakur: Dbt. Seed to clean up and scale up managing the data set.

160 00:18:17.970 ⇒ 00:18:22.819 Awaish Kumar: Yeah, but you don’t know about Dvdcs. So like Dbcs is not about cleanup.

161 00:18:23.020 ⇒ 00:18:25.110 Awaish Kumar: She’ll put.

162 00:18:25.530 ⇒ 00:18:33.890 Awaish Kumar: Yeah. I I asked, what what is Dbdc’s. But that was a different question. It is they both don’t relate to each other.

163 00:18:34.390 ⇒ 00:18:37.040 Abhijith Thakur: I mean like.

164 00:18:37.740 ⇒ 00:18:43.400 Abhijith Thakur: let’s say, if I had an Csv file so which is

165 00:18:43.660 ⇒ 00:18:46.920 Abhijith Thakur: placed in the Directory and.

166 00:18:46.920 ⇒ 00:18:51.060 Awaish Kumar: That’s the that was the past question with the what I’m asking now

167 00:18:51.540 ⇒ 00:18:54.519 Awaish Kumar: is, it is not related to seeds anymore.

168 00:18:55.100 ⇒ 00:18:55.560 Abhijith Thakur: Yeah, but it’s.

169 00:18:55.560 ⇒ 00:18:56.360 Awaish Kumar: One question.

170 00:18:57.920 ⇒ 00:19:11.630 Awaish Kumar: Now, we don’t talk about Csv file. But I’m I’m saying that data is already in a table in a data warehouse. I’m reading data from a table. I have to transform it and writing back to a table. So there’s no need of fives here.

171 00:19:13.570 ⇒ 00:19:24.400 Awaish Kumar: But my scal carry is is written already. I’m just asking, how can I modularize it? What kind of what kind of features DVD. Provide to modularize it?

172 00:19:26.394 ⇒ 00:19:29.129 Abhijith Thakur: What kind of features?

173 00:19:31.600 ⇒ 00:19:41.230 Awaish Kumar: Okay, so Dbt has the concept of mac macros. Right? So using macros, you can actually write different functions.

174 00:19:41.490 ⇒ 00:19:49.740 Awaish Kumar: But and also you can use the concept of are empentical.

175 00:19:51.360 ⇒ 00:19:52.270 Awaish Kumar: So

176 00:19:53.920 ⇒ 00:20:06.860 Awaish Kumar: the like, the the materialization. So there are different materializations. You can use some materializations which basically provides you a kind of a function where you wrote some code.

177 00:20:07.170 ⇒ 00:20:08.680 Awaish Kumar: and it will.

178 00:20:08.940 ⇒ 00:20:23.379 Awaish Kumar: It isn’t a separate file, but you can actually reference it and just use it in your new model. You don’t have to rewrite it. And the concept of macros is kind of similar when you have to like transform something or do some

179 00:20:23.490 ⇒ 00:20:38.240 Awaish Kumar: multiple some calculation, like multiple times in a query, or something like that, you can move it outside of your or something which has been is which is being used in multiple models.

180 00:20:38.390 ⇒ 00:20:41.430 Awaish Kumar: So like, we create utility functions, right.

181 00:20:41.640 ⇒ 00:20:48.889 Awaish Kumar: which which can be used in different files. So it is similar to that like, you can create a macro which is gonna function.

182 00:20:49.000 ⇒ 00:20:52.089 Awaish Kumar: And then you just reference it.

183 00:20:52.220 ⇒ 00:20:53.691 Awaish Kumar: It may be like

184 00:20:54.600 ⇒ 00:21:05.490 Awaish Kumar: a big function with 50 lines of SQL carry, and you just provide reference it in your model. And now we have less 50 lines less in our actual sq.

185 00:21:05.890 ⇒ 00:21:13.630 Awaish Kumar: so things like that you can basically modalize the code using that. So yeah, it’s it’s about some Dvt

186 00:21:13.760 ⇒ 00:21:16.549 Awaish Kumar: we can. Now more talk about

187 00:21:17.829 ⇒ 00:21:29.100 Awaish Kumar: SQL, SQL and data modeling. I will just keep it there, because that’s where you have most experience. As you mentioned.

188 00:21:29.900 ⇒ 00:21:34.939 Awaish Kumar: so in 1st year like, how do you rate yourself right out of 10.

189 00:21:37.440 ⇒ 00:21:40.070 Abhijith Thakur: 8.

190 00:21:41.680 ⇒ 00:21:49.630 Awaish Kumar: Okay? So for example, I have a

191 00:21:50.930 ⇒ 00:21:55.420 Awaish Kumar: for example, I have a very big table.

192 00:21:55.660 ⇒ 00:22:07.330 Awaish Kumar: No, no, puts. It’s like millions of records and the I want to.

193 00:22:07.720 ⇒ 00:22:10.129 Awaish Kumar: I’m hearing something from this table.

194 00:22:11.460 ⇒ 00:22:13.851 Awaish Kumar: It only has right now. It only has.

195 00:22:15.609 ⇒ 00:22:20.439 Awaish Kumar: your name, like the questions like, what, for example, customer, name.

196 00:22:20.600 ⇒ 00:22:24.190 Awaish Kumar: and the order order like, and the events

197 00:22:24.370 ⇒ 00:22:26.790 Awaish Kumar: it is performing on my application.

198 00:22:26.950 ⇒ 00:22:32.930 Awaish Kumar: So customer, name, event, name, event, name, is Mike, perform the click or

199 00:22:34.370 ⇒ 00:22:39.960 Awaish Kumar: page view, or whatever it is, through my application and the timestamp.

200 00:22:40.410 ⇒ 00:22:49.669 Awaish Kumar: and like only 3 columns. But you know there are like the table has grew, because

201 00:22:49.780 ⇒ 00:22:55.480 Awaish Kumar: every customer I have they are very active in my app, and they are performing regularly

202 00:22:55.590 ⇒ 00:23:06.939 Awaish Kumar: a lot of different events on my application. That, and we are capturing it into a table, into a single table which is installing each person’s different events. So there are maybe, like

203 00:23:07.260 ⇒ 00:23:10.823 Awaish Kumar: 10,000 customers. But there are millions of events being

204 00:23:11.946 ⇒ 00:23:16.310 Awaish Kumar: performed every day, and they are coming to this table.

205 00:23:16.470 ⇒ 00:23:18.769 Awaish Kumar: Principle has grew a lot now

206 00:23:19.302 ⇒ 00:23:24.930 Awaish Kumar: amount. When I carry this table to see? Like, okay, I am. You know.

207 00:23:25.700 ⇒ 00:23:29.110 Awaish Kumar: what are person X image

208 00:23:29.450 ⇒ 00:23:36.459 Awaish Kumar: today in in some time frame like in like between one to 3 pm.

209 00:23:36.770 ⇒ 00:23:41.779 Awaish Kumar: Today. How many events performed by a customer named average

210 00:23:41.980 ⇒ 00:23:51.269 Awaish Kumar: something like that? When I write a skill? Query, which is, it takes a lot of time to to actually carry this table. So how can I optimize

211 00:23:52.042 ⇒ 00:24:03.609 Awaish Kumar: this table or restructure this table, or anything you have in your mind, so that my purely retrieval time becomes really fast.

212 00:24:06.270 ⇒ 00:24:11.509 Abhijith Thakur: Fool to optimize. I think we can use stored procedures

213 00:24:12.040 ⇒ 00:24:20.259 Abhijith Thakur: where we can be able to reuse like it works as in functions like where we can go ahead and reuse it, and.

214 00:24:20.260 ⇒ 00:24:28.209 Awaish Kumar: Yeah, but that’s just like a function I’m not talking about right now. I have a query which I’m running. I I don’t

215 00:24:28.700 ⇒ 00:24:29.590 Awaish Kumar: okay

216 00:24:30.300 ⇒ 00:24:38.649 Awaish Kumar: care about how it is being executed like, how it’s written, if it executes in a from a store procedure, or it executes.

217 00:24:38.800 ⇒ 00:24:40.369 Awaish Kumar: just as a query

218 00:24:40.580 ⇒ 00:24:47.390 Awaish Kumar: the exhibition time is of the query which which will be the same for, exclusive to your time, of a store procedure.

219 00:24:47.620 ⇒ 00:24:54.489 Awaish Kumar: That’s a lot right? I just want to optimize that. So if it if a query is running like taking like 2 min, I just want to

220 00:24:55.129 ⇒ 00:24:58.699 Awaish Kumar: wanted to like finish in 3 seconds.

221 00:24:58.810 ⇒ 00:25:00.220 Awaish Kumar: That’s my goal.

222 00:25:02.320 ⇒ 00:25:29.490 Abhijith Thakur: we can use partitioning like if we are, if the database supports like multiple post queries or snowflake, or we can partition it partition at the table by the timestamp. Where you write a query, if for a query like, when you write a query, a specific time range like it only scans that partition or else we can go ahead with maybe usage of clustering

223 00:25:30.490 ⇒ 00:25:38.040 Abhijith Thakur: like to run the SQL. Query by like.

224 00:25:38.040 ⇒ 00:25:40.660 Awaish Kumar: How, how flustering is going to help us.

225 00:25:42.080 ⇒ 00:25:57.089 Abhijith Thakur: Clustering so like when generally you run a query like, if you run a SQL. Query, it takes like 2 min to run, and then you want to bring down the did that that time to

226 00:25:57.300 ⇒ 00:26:04.460 Abhijith Thakur: 1 min or 2 seconds or 10 seconds like. And to do this generally

227 00:26:04.630 ⇒ 00:26:10.759 Abhijith Thakur: we can use the clustering. And how does this clustering work?

228 00:26:14.600 ⇒ 00:26:23.496 Abhijith Thakur: I can say, Click, it helps to speed up the queries. But

229 00:26:23.940 ⇒ 00:26:27.939 Awaish Kumar: What is this technique? How will how we are? We are going to use it.

230 00:26:28.620 ⇒ 00:26:33.840 Awaish Kumar: How are we are going to utilize 1st rate? That’s the question, right how it’s going to help us.

231 00:26:36.610 ⇒ 00:27:00.570 Abhijith Thakur: let’s say, when you have a table as in millions of records, and often filter them by different different. Let’s say, like different fields. And then this database, where you can scan only a blog instead of reading the entire data set. If you are frequently filtering at the events which are specific for customer service.

232 00:27:01.600 ⇒ 00:27:03.420 Awaish Kumar: But that is cholesterol right?

233 00:27:05.890 ⇒ 00:27:08.169 Abhijith Thakur: You want me to define, clustering like.

234 00:27:08.850 ⇒ 00:27:14.299 Awaish Kumar: What is I? I don’t understand the like, how it is going to have these blocks. How

235 00:27:14.440 ⇒ 00:27:24.009 Awaish Kumar: you mentioned, like the search is going to be only in some blocks. But how that is going to be like for example you mentioned, you you talk about partitioning

236 00:27:24.840 ⇒ 00:27:27.370 Awaish Kumar: let’s like how partitioning works. For example.

237 00:27:31.140 ⇒ 00:27:38.399 Abhijith Thakur: Like, organizing the data based on the disks and values in in in specific columns.

238 00:27:40.880 ⇒ 00:27:43.420 Awaish Kumar: Okay. And then, like how we.

239 00:27:43.870 ⇒ 00:27:48.120 Awaish Kumar: how the insertions will work, and how the retrieval will work.

240 00:27:50.074 ⇒ 00:27:51.040 Abhijith Thakur: Insertions.

241 00:27:52.670 ⇒ 00:27:54.238 Abhijith Thakur: I can say that.

242 00:27:58.380 ⇒ 00:28:03.110 Awaish Kumar: But in in post case, for example, if I created a partition table.

243 00:28:03.540 ⇒ 00:28:13.850 Awaish Kumar: So I’m going through. I I said, I have a table called events right? And events. Table is a great partition, right? As you said

244 00:28:14.230 ⇒ 00:28:18.380 Awaish Kumar: so now like, if I want to insert something in a

245 00:28:18.890 ⇒ 00:28:40.850 Awaish Kumar: you know partition which which don’t exist yet in postgres what you are normally going to do like I. I gave you a data to load into a table. But the the data is contained. Data has at least for like July 29.th But there’s no partition for July 29th yet.

246 00:28:41.070 ⇒ 00:28:44.990 Awaish Kumar: So how you are going to proceed right for insertion.

247 00:28:46.160 ⇒ 00:28:47.710 Abhijith Thakur: Hmm, no.

248 00:28:53.660 ⇒ 00:28:57.107 Abhijith Thakur: For the partition. I can say that

249 00:28:59.250 ⇒ 00:29:11.610 Abhijith Thakur: like event based partition based on the specific event date and the or like

250 00:29:12.300 ⇒ 00:29:30.870 Abhijith Thakur: the tables, tables, partition, schema. Divide these by step by step, and then make sure your incoming data set has a valid columns, and then, if it doesn’t have it, then you’ll need to extract or derive it, so that you know.

251 00:29:30.870 ⇒ 00:29:34.479 Awaish Kumar: No, no, I’m saying that what I’m saying is, that is,

252 00:29:35.150 ⇒ 00:29:42.509 Awaish Kumar: So we do. We have a influence table, which which is a partition table. We have defined that column.

253 00:29:43.046 ⇒ 00:29:44.753 Awaish Kumar: event date, which is

254 00:29:45.680 ⇒ 00:29:57.390 Awaish Kumar: It is a daytime column, and we are. We have defined the partition on on top of it. So what I’m mentioning is that I gave you a file, a Csv file to load

255 00:29:57.590 ⇒ 00:30:00.779 Awaish Kumar: the bunch of events into a into this table.

256 00:30:01.020 ⇒ 00:30:08.159 Awaish Kumar: So those all events which I gave you a list of events which are basically all from July 29.th

257 00:30:08.370 ⇒ 00:30:18.790 Awaish Kumar: But the table where you have to insert. All this does not have a partition for July 29th yet, because there is no entry for July 29.th

258 00:30:19.590 ⇒ 00:30:25.469 Awaish Kumar: Prefer this. So how you are going to have that partition. How then you proceed with insertion.

259 00:32:04.800 ⇒ 00:32:05.430 Awaish Kumar: Hello!

260 00:33:18.810 ⇒ 00:33:19.790 Awaish Kumar: Hello!

Brainforge Knowledge

Explorer

2025-07-29_data_engineer_interview_abhijith_thakur_64e856d4

Graph View