Meeting Title: Data Engineer Interview (Jonthan Gomendoza) Date: 2025-07-16 Meeting participants: Awaish Kumar, jonathan g


WEBVTT

1 00:01:13.160 00:01:16.100 jonathan g: Hey? I wish good day.

2 00:01:20.990 00:01:22.780 Awaish Kumar: Hello! How are you doing.

3 00:01:23.530 00:01:28.559 jonathan g: I’m fine. Thank you. Thanks for asking. Yeah. I just wanted to ask, How are you.

4 00:01:29.720 00:01:31.600 Awaish Kumar: Yeah, I’m good as well.

5 00:01:32.780 00:01:39.460 jonathan g: Hey, Aish, is it? Alright if we do on off Cam, because I had Internet issues on my side.

6 00:01:41.470 00:01:43.190 Awaish Kumar: Sorry. Sorry. Can you please come again.

7 00:01:43.190 00:01:49.230 jonathan g: Hello, okay, I will repeat my request. My request. Is it all right? We do off cam

8 00:01:49.994 00:01:55.660 jonathan g: because I had Internet issues on my side, but we can. We can continue the call.

9 00:01:57.380 00:02:01.130 Awaish Kumar: Yeah, but I think it would be nice if we could have

10 00:02:01.630 00:02:06.459 Awaish Kumar: cameras open. But it’s like more like a conversation

11 00:02:07.108 00:02:10.219 Awaish Kumar: not like a test, if it.

12 00:02:10.870 00:02:19.359 Awaish Kumar: If it was a kind of a some kind of desk, then I could just say, you know, to like work on that. I’m just waiting. We don’t need cameras for that.

13 00:02:19.780 00:02:25.159 Awaish Kumar: but this is like a 2 way conversation where we are going to like, discuss about the

14 00:02:25.430 00:02:30.920 Awaish Kumar: the brain forge, about your experiences and how we can potentially collaborate.

15 00:02:31.180 00:02:37.570 Awaish Kumar: So it it. I, I don’t have issues in rescheduling. If you have Internet issues right now.

16 00:02:38.222 00:02:40.877 Awaish Kumar: you can reschedule it at any

17 00:02:41.450 00:02:44.899 Awaish Kumar: time in my calendar. We can do it later.

18 00:02:46.340 00:02:46.950 jonathan g: Oh.

19 00:02:47.862 00:02:56.030 jonathan g: yeah, I think we could continue. But is it alright? I do off cam on my side, or I should do on camp on side as well.

20 00:03:01.210 00:03:05.460 jonathan g: Because right now, yeah, internet issues, I.

21 00:03:05.460 00:03:06.030 Awaish Kumar: Yeah.

22 00:03:06.370 00:03:07.760 jonathan g: We could continue, continue.

23 00:03:08.890 00:03:21.230 Awaish Kumar: That’s what I’m saying that this is a 2 way conversation I and I would prefer that we have a cameras on, and if it is not possible right now, because you have Internet issues. You can reschedule it

24 00:03:23.000 00:03:24.230 Awaish Kumar: for a later time.

25 00:03:28.100 00:03:28.990 Awaish Kumar: Okay.

26 00:03:29.539 00:03:40.730 Awaish Kumar: that’s that’s what I would advise if you are not able to connect to camera, just reschedule it at any other time, which, where you have when you, you have the good Internet connection.

27 00:03:43.290 00:03:44.210 Awaish Kumar: Okay.

28 00:03:45.250 00:03:53.170 jonathan g: Okay, I will give. I will give my best effort on the data to configure my

29 00:03:53.370 00:03:55.390 jonathan g: cam first.st You said alright.

30 00:03:56.300 00:03:58.069 jonathan g: I’ll just give my best effort.

31 00:03:58.340 00:03:59.740 jonathan g: Okay, I’ll just.

32 00:03:59.740 00:04:03.830 Awaish Kumar: No like. My just question is like, Do you want to continue now?

33 00:04:03.830 00:04:04.200 Awaish Kumar: Yes.

34 00:04:05.090 00:04:05.810 jonathan g: We’ll proceed.

35 00:04:05.810 00:04:08.451 Awaish Kumar: Keep your cameras on, or

36 00:04:09.540 00:04:15.360 Awaish Kumar: we can do it any at any time later. So Snake, no pressure.

37 00:04:16.440 00:04:19.140 jonathan g: Sure. Thanks, thank you.

38 00:04:48.230 00:04:48.840 Awaish Kumar: Right.

39 00:04:51.510 00:04:53.400 jonathan g: Hey, Amish! Can you see me.

40 00:04:54.220 00:04:58.613 Awaish Kumar: Yeah, I can hear you. Sorry. Just give me a moment

41 00:04:59.633 00:05:07.690 Awaish Kumar: yeah, so we can start. My name is Avish Kumar. I’m the engineering manager at Rainforge. And

42 00:05:08.895 00:05:12.870 Awaish Kumar: today we are going in this interview. We are going to talk about

43 00:05:13.358 00:05:28.670 Awaish Kumar: what Brainforge does. And then we are going to know a little bit more about you. Your experience in the kind of projects you have been working on, and what is your contribution in your projects. We’ll deep dive into the project term

44 00:05:29.597 00:05:33.670 Awaish Kumar: architecture, and how you are building the pipelines and all.

45 00:05:33.820 00:05:51.470 Awaish Kumar: So yeah, let’s start. So my name is Arishma. So what what brainforce does is Brainforge is a data AI consultancy from we provide data services to different client clients spanning across the industries.

46 00:05:52.460 00:06:03.997 Awaish Kumar: now we are also providing quite a few services, building AI agent boards based on the Lms and different like supporting the different

47 00:06:05.910 00:06:27.129 Awaish Kumar: natural language. kind of processing work, like, for example, somebody wants to build a chat board to basically book something some hotels on something like that. So all of that powered by I like, that’s kind of services we are providing along with. All the data.

48 00:06:27.290 00:06:29.530 Awaish Kumar: all the data kind of work like

49 00:06:29.630 00:06:39.130 Awaish Kumar: data analysis, data, engineering data, analytics, everything data. And we don’t have us.

50 00:06:39.270 00:06:43.980 Awaish Kumar: The specific set of tools which we work with is more like based on the clients.

51 00:06:44.230 00:06:49.086 Awaish Kumar: So for each client and for each use case, we going to define what kind of

52 00:06:49.620 00:07:16.190 Awaish Kumar: tools are going to be needed for for this and what? And also what client can approve and based on that? We use different tools and technologies. But this gives us enough and edge that because we are able to explore more and more tools let like the the come, the new tools which are coming in market. If we explore that use them for our new clients.

53 00:07:17.370 00:07:25.890 Awaish Kumar: So that’s basically what Brainforge is doing is a team of like 10 to 15 people including the engineering and the

54 00:07:26.596 00:07:33.030 Awaish Kumar: sales marketing team. And yeah, that’s that’s everything about Brainforge.

55 00:07:33.250 00:07:36.390 Awaish Kumar: So now, like, if you can introduce yourself, please.

56 00:07:37.700 00:07:51.899 jonathan g: Alright. Thanks for that introduction, Alicia. I’m Jonathan. I’ve been working in the it industry for 11, going for 11 years, going 12 years, so I’ve been exposed to industries like

57 00:07:52.000 00:08:08.510 jonathan g: financial services, Telco. There’s also retail food and beverages, government institutions. There’s also conglomerates. Some are in the Philippines, some are located in Philippines. And there’s some are located in

58 00:08:09.200 00:08:15.565 jonathan g: all part other parts of the world like Europe. There’s also us. And there’s also,

59 00:08:16.850 00:08:29.059 jonathan g: yeah, there’s some are in. Yeah, those are the areas I’ve been working with. Aside from that one, I was also exposed to healthcare industry and also Hr industry. The most recent.

60 00:08:29.390 00:08:48.739 jonathan g: Then for the technologies I’ve been exposed to the cloud vendors like Aws, there’s all. I also have knowledge in Google Cloud Platform and Microsoft. Assure, aside from that one. There’s also infrastructure scope, whereas aws cloud formation. We know that this is owned by aws.

61 00:08:48.880 00:08:51.789 jonathan g: then there’s also open source like data port.

62 00:08:52.290 00:09:01.290 jonathan g: Aside from that one. There’s also open source tools like Dbt or data build tool where you could do some

63 00:09:01.750 00:09:05.600 jonathan g: orchestration within the data warehouse or the database itself.

64 00:09:06.030 00:09:24.159 jonathan g: And also speaking of database, I had also experience in structured query, language or SQL. So I had exposure as well in other database vendors like Microsoft, SQL. Mysql. And post your SQL. Then for programming languages I’ve been

65 00:09:24.370 00:09:33.629 jonathan g: exposed to Python and Nodejs recently. I’ve been doing upscaling as well in Python Snowflake and assure fabric.

66 00:09:33.880 00:09:39.019 jonathan g: Then for the the the. There’s also the

67 00:09:39.270 00:09:45.379 jonathan g: context of tabs. There’s also continuous integration, continuous delivery, or Cicd

68 00:09:45.580 00:09:51.389 jonathan g: the tools I was exposed to. There was Github actions, bit market, pipeline and azure devops.

69 00:09:51.630 00:09:55.510 jonathan g: There’s also project management tools like Jira and confluence.

70 00:09:55.640 00:09:58.990 jonathan g: Recently I was exposed to linear and notion.

71 00:09:59.540 00:10:11.669 jonathan g: Then, apart from that one there is also Etl or Elt. So Etl. It is extract, transform, and load. Recently I was also exposed to extract, load, and

72 00:10:11.770 00:10:12.540 jonathan g: transform.

73 00:10:15.540 00:10:25.329 Awaish Kumar: Okay, so talking about your recent experiences, can you give me an experience of a project where

74 00:10:26.360 00:10:31.510 Awaish Kumar: you optimized some existing data pipeline.

75 00:10:32.230 00:10:33.550 jonathan g: Alright.

76 00:10:33.950 00:10:45.309 jonathan g: right? Yeah, I can share a few. So for the existing data pipeline. So I’ll just give you the context for more like so yes, there’s an existing data pipeline.

77 00:10:45.420 00:10:51.800 jonathan g: So the tech stack they’re using was in Ws, so to be specific services. Lambda

78 00:10:51.910 00:11:01.000 jonathan g: using cloud formation as the infrastructure of code. Then they are storing the data in Google, Cloud Storage and bigquery.

79 00:11:01.270 00:11:07.919 jonathan g: So the it is good for short term. But in the long term it’s not

80 00:11:08.070 00:11:11.670 jonathan g: possible already, because moving, because the

81 00:11:11.840 00:11:15.529 jonathan g: 1st data which is stored in Aws Rds instance.

82 00:11:15.940 00:11:25.429 jonathan g: the live environment or the production environment keeps on aggregating. So if there is a trace of data of the room, count.

83 00:11:25.750 00:11:35.370 jonathan g: So there is like a net a time up when loading data to Google Cloud storage, though it could populate in bigquery, but it is not

84 00:11:35.940 00:11:39.520 jonathan g: peaceable for long term, so

85 00:11:39.920 00:11:48.060 jonathan g: I’ve done a proof of concept. So there are 2 services. I’ll be used. So one is in Aws using Aws Group.

86 00:11:48.200 00:11:57.130 jonathan g: the other one is using Google Cloud Data Stream that is owned by Google Cloud Platform. Upon doing some research, then that includes as well

87 00:11:57.300 00:12:07.159 jonathan g: cost the costing as well how it’s gonna be used. And also for the long term, it seems like during my research, I had to use

88 00:12:07.320 00:12:11.379 jonathan g: data stream, the data pipeline.

89 00:12:11.480 00:12:15.259 jonathan g: So the reason of using data stream is that

90 00:12:15.360 00:12:22.669 jonathan g: there is the functionality of change data capture where it whereas that if there is a

91 00:12:22.850 00:12:27.750 jonathan g: changes or an update in your source label or your schema.

92 00:12:27.960 00:12:31.740 jonathan g: then it will also reflect, then it will process that

93 00:12:32.290 00:12:36.110 jonathan g: to to load your data to Google Cloud Storage.

94 00:12:36.210 00:12:41.230 jonathan g: Then for bigquery, you can just map or point it to

95 00:12:42.000 00:12:47.489 jonathan g: to be to Google Cloud Storage. So that’s 1 of the enhancement pipelines. I’ve done

96 00:12:47.660 00:12:54.299 jonathan g: another. Another enhancement I’ve done is that there is a security problem.

97 00:12:54.430 00:12:56.540 jonathan g: So the problem is that

98 00:12:56.950 00:13:05.710 jonathan g: the the password of the credentials. It’s exposed. So it means to say it was stored in a

99 00:13:06.580 00:13:11.370 jonathan g: in a code script, doesn’t, that is, in Json format to be specific.

100 00:13:11.900 00:13:26.529 jonathan g: So what should we do if you are going to, if it. That is the best practice when it comes to storing credentials. I don’t think that’s a best practice, because security team will audit that

101 00:13:26.790 00:13:33.329 jonathan g: pipeline or that repository. It will just call you out, said, this is not the best practice of storing your

102 00:13:33.840 00:13:35.070 jonathan g: your password

103 00:13:35.180 00:13:48.399 jonathan g: in a code level. So does it matter if that is a project owned repository, or that is a public repository, because we do not want to compromise the data of the client.

104 00:13:48.610 00:13:59.560 jonathan g: So I suggested that how about we use aws secrets, manager, that is owned by Aws, which is a password manager owned by aws! So that’s 1.

105 00:13:59.680 00:14:04.479 jonathan g: There’s also Github secrets that is own, that is also owned by Github.

106 00:14:04.680 00:14:12.150 jonathan g: Another option is that, how about we use Google Cloud platform secrets manager that is owned by Dcp.

107 00:14:12.690 00:14:24.089 jonathan g: So all options are are take into consideration as well. So it varies as well that if you are going to use Aws services, then utilize

108 00:14:24.200 00:14:28.329 jonathan g: aws secrets. Manager. If we are going to utilize

109 00:14:28.920 00:14:32.169 jonathan g: Tcp services, then we use Gcp secrets.

110 00:14:32.370 00:14:37.629 jonathan g: If this is like a open source, let us utilize it have secrets.

111 00:14:37.870 00:14:41.370 jonathan g: So there are options that takes into consideration.

112 00:14:42.880 00:14:46.659 jonathan g: Then, aside from that one, there’s also

113 00:14:47.040 00:14:51.799 jonathan g: one of my team members doing manual deployment. So the manual.

114 00:14:51.990 00:14:54.260 Awaish Kumar: For example, I have a table

115 00:14:54.480 00:14:58.480 Awaish Kumar: with billions of billions of like rows.

116 00:14:59.070 00:15:03.419 Awaish Kumar: and I want to run some queries on top of it.

117 00:15:03.870 00:15:07.470 Awaish Kumar: And my query is already very slow.

118 00:15:09.750 00:15:12.189 Awaish Kumar: because there’s a lot of data. Obviously.

119 00:15:12.480 00:15:16.759 Awaish Kumar: I want to identify the user with the

120 00:15:17.687 00:15:25.230 Awaish Kumar: some like, I want to search for some users. And the table is like, kind of have a 1st name

121 00:15:26.456 00:15:38.319 Awaish Kumar: email phone number address, and few more field such as like, you know, for example, then

122 00:15:39.530 00:15:42.890 Awaish Kumar: region or whatever like, etc. So

123 00:15:43.740 00:15:50.959 Awaish Kumar: now I’m I want to carry for a user. I want to search for a user. I want to get address, for example.

124 00:15:51.100 00:15:58.180 Awaish Kumar: So this Kelly is taking more than a like minute to execute.

125 00:15:58.460 00:16:01.220 Awaish Kumar: So how can I optimize?

126 00:16:01.720 00:16:02.640 Awaish Kumar: Wow!

127 00:16:02.930 00:16:08.560 Awaish Kumar: What steps can I take to optimize, to reduce this security time.

128 00:16:10.350 00:16:16.229 jonathan g: Oh, from what I understand about the problem is that it takes time when it comes to query, Is that correct?

129 00:16:16.520 00:16:23.150 jonathan g: So what I, what I will be doing it. That’s a a solution for that one is that

130 00:16:23.450 00:16:35.330 jonathan g: yes, you will. You need to assess that which columns that is needed for your investigation or for your report. That will depend on your client requirements.

131 00:16:35.650 00:16:40.459 jonathan g: So if it says that all columns should be included.

132 00:16:40.640 00:16:47.180 jonathan g: then you need to access as well. What are the filters that is needed for your for the business requirements?

133 00:16:47.400 00:16:55.239 jonathan g: So we will check on there like status or tagging, that what are the

134 00:16:55.690 00:16:58.269 jonathan g: filters or the values that is needed?

135 00:16:58.380 00:17:04.269 jonathan g: Another option is that we can use common table expression for cte.

136 00:17:04.410 00:17:11.320 jonathan g: So you had like a getting all the columns. Then from there you could add some filters from start.

137 00:17:11.740 00:17:12.789 jonathan g: After doing that.

138 00:17:13.534 00:17:19.599 Awaish Kumar: Is so if I even if I read in a cte like that’s going to read full table.

139 00:17:20.010 00:17:20.470 jonathan g: Hmm.

140 00:17:21.380 00:17:23.210 Awaish Kumar: So that’s like, that’s

141 00:17:23.560 00:17:52.410 Awaish Kumar: that’s the problem. Right? I’m I don’t. If I read full table, it’s it’s just takes. That’s why it takes longer if I carry it by like, I want to search for myself on my name. And there are billions of rows in that like I don’t want to. I’m I don’t. I’m not looking for optimizing, for multiple queries like, Okay, we load this data into memory, and after that every search will be faster.

142 00:17:53.558 00:18:06.129 Awaish Kumar: That’s not the the situation here. We don’t want to like. That’s a different part where we want to optimize for multiple, similar multiple requests. So

143 00:18:06.510 00:18:09.069 Awaish Kumar: search for search by name

144 00:18:09.430 00:18:22.950 Awaish Kumar: for 10 employees. So we load the data into memory and then search in memory instead of directly carrying the database. That’s 1 of the solutions like with City. You want to do that. But what? My question is more like.

145 00:18:23.300 00:18:28.160 Awaish Kumar: how? How can I structure my table itself, so I can reduce my carry time.

146 00:18:30.700 00:18:35.990 jonathan g: I think I will. I will depend on the filter based on the tagging. So, for example, if you are.

147 00:18:35.990 00:18:41.700 Awaish Kumar: So my requirement. I’ve told you that the I want to carry for a user

148 00:18:41.930 00:18:49.590 Awaish Kumar: based on its number. For example, name or whatever. Right? So

149 00:18:51.640 00:19:01.120 Awaish Kumar: for example, I I say, like, I have a table with like events table. If you can say it’s kind of events. Table I have a mob. For example, if I

150 00:19:02.060 00:19:08.630 Awaish Kumar: make a full example, I have a mobile application. For example, my mobile application have 1,000

151 00:19:09.222 00:19:15.230 Awaish Kumar: users, and out of those 1,000 users. There they are making some

152 00:19:15.992 00:19:20.920 Awaish Kumar: click, some something, some button button click, some page views.

153 00:19:21.110 00:19:35.609 Awaish Kumar: So there are different activities happening on the app by all of these 1,000 users, and it builds up a table in the back end. I just have one table which is storing all these events

154 00:19:36.150 00:19:40.080 Awaish Kumar: right? And that one table is storing all these events

155 00:19:40.669 00:19:53.440 Awaish Kumar: by 1,000 users, and it has grow, grew so much that now it has like millions and millions of phones. Now, when I want to search for myself on my name

156 00:19:53.590 00:19:57.779 Awaish Kumar: in that table, that’s basically the problem that

157 00:19:58.310 00:20:02.490 Awaish Kumar: now I don’t. I’m just not searching for 1,000 rows.

158 00:20:02.630 00:20:12.110 Awaish Kumar: I have a table which has grew because of event activities happening. And I’m adding new rules at the table. Now I have millions of rows. I want to search for rows

159 00:20:12.260 00:20:18.570 Awaish Kumar: for a wish, Kumar, and now, when I’ll do that, it is taking 1 min more than 1 min to execute

160 00:20:19.130 00:20:20.470 Awaish Kumar: for a single query.

161 00:20:20.590 00:20:25.770 Awaish Kumar: How can I? Maybe you can propose restructuring of the table. You can propose?

162 00:20:26.979 00:20:32.250 Awaish Kumar: any kind of optimization strategies, anything we can do in this situation.

163 00:20:34.670 00:20:42.599 jonathan g: Well, judging from your explanation, there are some who are going to use indexing the index function. Some would use that.

164 00:20:42.820 00:20:52.349 jonathan g: But yeah, that’s good for the performance, but for long term it’s not good friend, for utilization. That’s 1 using index.

165 00:20:52.680 00:21:01.620 jonathan g: You’re going to index your name. Another thing is that you need to have, like our statement that only your name, it will be filtered

166 00:21:01.940 00:21:03.960 jonathan g: also. Another thing is that.

167 00:21:04.550 00:21:18.630 jonathan g: do you need all the whole rooms? Because if you need all the homes, then it as expected. That will be cost a runtime, unless you will be eliminating some of the

168 00:21:19.450 00:21:25.950 jonathan g: columns that are not relevant, then that I would say that will also speed up the efficiency

169 00:21:26.060 00:21:33.030 jonathan g: other options, that if if the company or the client has

170 00:21:33.480 00:21:41.379 jonathan g: has a good budget or a high budget. Then there will be times that we could increase the memory of that gigabytes.

171 00:21:41.520 00:21:54.000 jonathan g: But I don’t think that will be considered so, since it. It needs approval from the upper, from the higher ups when it comes to increasing memory, because that will also increase cost.

172 00:21:54.450 00:21:59.450 Awaish Kumar: Yeah, we are not looking to increase memory. And I’m just want to optimize

173 00:21:59.620 00:22:05.390 Awaish Kumar: the like, just want to work with. SQL, so

174 00:22:05.570 00:22:16.129 Awaish Kumar: how would you architect your database and how you can employ some optimization techniques provided by SQL, so everything else you can leave it house

175 00:22:16.390 00:22:23.009 Awaish Kumar: alright. I’m just asking in context of as well. So we can leave everything else on the side.

176 00:22:26.000 00:22:30.220 Awaish Kumar: So you mentioned about indexing and what else we can do.

177 00:22:31.310 00:22:37.179 jonathan g: Indexing, that’s 1. There’s also you need to use our statement.

178 00:22:37.370 00:22:43.539 jonathan g: Then you need to utilize the like function, if I may. Some would use the like function

179 00:22:43.830 00:22:45.279 jonathan g: or the like person.

180 00:22:45.380 00:22:47.830 jonathan g: There’s also some as well.

181 00:22:48.270 00:22:49.460 jonathan g: You need your phone.

182 00:22:49.460 00:22:55.989 Awaish Kumar: Patience, how functions are going to help with the optimization, the query, execution, time.

183 00:22:56.990 00:23:02.760 jonathan g: You need to check on your columns as well. What are the columns that is needed for your relevant investigation?

184 00:23:02.760 00:23:09.590 Awaish Kumar: Oh, like, yeah, I’m I’m getting so my kid is just selects only the columns needed.

185 00:23:09.930 00:23:21.429 Awaish Kumar: I’m not reading all the from all the columns I need. I use name of the user and the address searching on the name of the user. That’s all I’m doing. I’m not selecting the

186 00:23:22.740 00:23:26.280 Awaish Kumar: the extra columns which are present in the table.

187 00:23:27.860 00:23:29.890 Awaish Kumar: but still it takes that time.

188 00:23:34.430 00:23:40.569 jonathan g: Hmm! Another thing is that what I can do is that I will use the name.

189 00:23:40.730 00:23:53.609 jonathan g: Then I will do a row, count, so I will check how many rows, or how many row counts or rows that contains doesn’t matter if that is a wish, Jonathan or

190 00:23:53.780 00:24:05.559 jonathan g: John. Yeah, that’s I will check as well on how many, because, as expected. If you will, let’s say, if it reach 500, then we expect that

191 00:24:05.900 00:24:08.379 jonathan g: it will. It should be running fast.

192 00:24:08.490 00:24:14.960 jonathan g: But if it reaches around 1 million, we’re already expecting that. Yeah, there are a lot of

193 00:24:15.490 00:24:20.960 jonathan g: transactions that’s being done by John. I wish, or Jonathan, based on the filters.

194 00:24:21.100 00:24:27.839 Awaish Kumar: Okay, yeah. But like, this is, this is more like auditing the data that

195 00:24:28.050 00:24:31.400 Awaish Kumar: how many roles we have for each person.

196 00:24:32.090 00:24:34.850 Awaish Kumar: But what about?

197 00:24:35.130 00:24:36.060 Awaish Kumar: Oh.

198 00:24:37.248 00:24:43.100 Awaish Kumar: like this is the 1st like you’d audit the data. You got the rows number of calls. But I

199 00:24:43.210 00:24:44.897 Awaish Kumar: I’m saying that

200 00:24:46.940 00:24:53.449 Awaish Kumar: that’s like, even if the reasonable like this is taking way longer than I would expect.

201 00:24:53.750 00:24:55.480 Awaish Kumar: I want to optimize it.

202 00:24:58.170 00:25:00.670 Awaish Kumar: So what about if we

203 00:25:01.600 00:25:07.739 Awaish Kumar: think about restructuring of the table like I. What if we split the table into 2.

204 00:25:10.540 00:25:11.900 jonathan g: More like partitioning.

205 00:25:13.140 00:25:18.949 Awaish Kumar: This is, yeah, that’s number 1. 1 of the. This can be one of the strategies partitioning.

206 00:25:19.280 00:25:22.649 Awaish Kumar: Yeah. Number 2 could be that I told you

207 00:25:22.820 00:25:25.700 Awaish Kumar: that I’m searching for a person’s name.

208 00:25:25.860 00:25:30.830 Awaish Kumar: so name is is a constant like, I’m making 1 million events.

209 00:25:30.930 00:25:37.269 Awaish Kumar: So name Avesh Kumar is redundant in August. Middle 1 million rows. What if I create another table?

210 00:25:37.928 00:25:48.929 Awaish Kumar: With string values, names and ids. And instead of searching for name in the long, a big table, I search based on integer.

211 00:25:50.870 00:25:54.119 Awaish Kumar: So searching on integer is faster than searching on a string right.

212 00:25:57.020 00:26:00.170 jonathan g: Well, we can’t consider on that one we could like.

213 00:26:00.170 00:26:06.599 Awaish Kumar: One of the strategies that like, if if I search on a string, that’s it’s very slow

214 00:26:07.040 00:26:09.940 Awaish Kumar: when you compare searching for an integer.

215 00:26:10.660 00:26:16.019 Awaish Kumar: and that’s very fast. So, and we know that we only have 1,000 names.

216 00:26:16.360 00:26:19.500 Awaish Kumar: so we can put it in in a different table.

217 00:26:19.950 00:26:24.059 Awaish Kumar: and it will give some get some id based on auto integer.

218 00:26:24.410 00:26:34.010 Awaish Kumar: Use that in in the larger table. So now you can add indexing the the strategy you mentioned on the integer column.

219 00:26:34.740 00:26:37.320 Awaish Kumar: so indexing that integer column is.

220 00:26:37.820 00:26:44.300 Awaish Kumar: and then searching on top of it is, will be really, really fast, and then searching on the string itself.

221 00:26:44.420 00:26:47.290 Awaish Kumar: So this is like, kind of a

222 00:26:48.260 00:26:54.799 Awaish Kumar: how you can like structure. But yeah, we can move ahead. And I can ask you mentioned about indexing.

223 00:26:54.930 00:27:00.179 Awaish Kumar: So what are different types of indexing all right and

224 00:27:01.870 00:27:03.999 Awaish Kumar: like. What are their pros and cons.

225 00:27:05.850 00:27:12.760 jonathan g: So the pros and cons for indexing that. Yes, the pro. It could speed up your query.

226 00:27:12.990 00:27:14.430 jonathan g: The point is that

227 00:27:14.590 00:27:27.539 jonathan g: there’s going to be problem moving forward, using the update delete and insert statement, especially if you apply the index function there, then that will also affect the performance and the utilization

228 00:27:28.990 00:27:33.600 jonathan g: that I would say, that’s the most simplified version for.

229 00:27:34.920 00:27:40.779 Awaish Kumar: So click, for example, how indexing works in databases.

230 00:27:43.190 00:27:49.469 jonathan g: From what I understand about indexing is that you could. It seems like you could do some.

231 00:27:49.910 00:27:52.080 jonathan g: You you query the table itself.

232 00:27:52.290 00:27:58.499 jonathan g: Then you’re trying to look some information or specific column. You’re trying to do some index.

233 00:27:58.670 00:28:02.550 jonathan g: Then, after that one, it was already.

234 00:28:02.550 00:28:05.780 Awaish Kumar: Like how database implements indexing right.

235 00:28:06.320 00:28:17.189 Awaish Kumar: For example, I have a column. I I want to add, indexing on that. I added a default on the indexing and on a column in a post table.

236 00:28:18.050 00:28:24.359 Awaish Kumar: but indexing how it is actually working like how it’s actually making the searching fast.

237 00:28:26.720 00:28:31.389 jonathan g: You need to type it in the, in your.

238 00:28:31.390 00:28:33.070 Awaish Kumar: Influence indexing in the back end.

239 00:28:35.760 00:28:48.829 jonathan g: And for that one, from what I understand about the back, the behavior of index, especially in the back end. So they are looking like needs to check on a specific column.

240 00:28:49.010 00:28:52.959 jonathan g: For example, use Id as your index. So from there

241 00:28:54.620 00:28:56.610 jonathan g: kind of row, row, reading row.

242 00:28:56.610 00:28:59.480 Awaish Kumar: Using some data structure right.

243 00:29:01.620 00:29:04.730 jonathan g: Yes, they do use data structure as well. But.

244 00:29:04.730 00:29:10.549 Awaish Kumar: That’s what I’m gonna understand. But what data structure they use and how it works.

245 00:29:17.230 00:29:19.009 jonathan g: Think for that one

246 00:29:21.130 00:29:24.580 jonathan g: I need to review on that one. Can I get back to you on that one.

247 00:29:24.990 00:29:30.840 Awaish Kumar: Okay. And what is difference between for clustered and non clustered, indexing.

248 00:29:32.200 00:29:35.519 jonathan g: For cluster index. It’s more like it is

249 00:29:35.930 00:29:40.289 jonathan g: con consolidated in one area. So there’s like a cluster.

250 00:29:40.570 00:29:43.590 jonathan g: whereas for non-clustered it is spread out.

251 00:29:43.760 00:29:57.149 jonathan g: spread out index. So, for example, for clustered, you are using one table to do your index, whereas non clustered index, you have multiple tables. You could do some indexes.

252 00:29:58.960 00:29:59.900 Awaish Kumar: Okay.

253 00:30:00.370 00:30:01.350 jonathan g: So.

254 00:30:01.810 00:30:05.570 Awaish Kumar: Hmm, what about?

255 00:30:10.120 00:30:17.099 Awaish Kumar: Okay, we can move ahead with a different set of questions. So what is the difference between acid and the base.

256 00:30:17.810 00:30:20.079 Awaish Kumar: 2 different types of systems.

257 00:30:22.370 00:30:24.980 jonathan g: Or can you repeat the question? Acid language.

258 00:30:26.060 00:30:34.980 Awaish Kumar: So so as it like, we know, we say that there are some transactional systems, and there are some analytical systems.

259 00:30:35.160 00:30:38.597 Awaish Kumar: right so, and both of them have

260 00:30:40.350 00:30:48.200 Awaish Kumar: different properties to satisfy. So for for transactional system, we have asset properties, right

261 00:30:48.370 00:30:54.709 Awaish Kumar: so, and the further analytical systems which which basically work on the properties of base.

262 00:30:55.210 00:30:58.939 Awaish Kumar: Can you elaborate more in acid and versus base.

263 00:31:00.480 00:31:16.620 jonathan g: Or acid. Yeah, you mentioned about transaction. This is appropriate for transactional data. So more like a big data, there’s no need. So even though there’s like a complex joint condition included, there will be.

264 00:31:16.760 00:31:29.410 jonathan g: That’s the transaction data that’s going to be used. That’s for. Whereas for base, this is more like the summary data or the aggregated data. So you or there’s also a there’s also

265 00:31:29.690 00:31:33.710 jonathan g: joining. Joining conditions include as well the complex ones.

266 00:31:33.910 00:31:40.540 jonathan g: This can be used for report dashboards or visualization for the

267 00:31:41.110 00:31:46.250 jonathan g: for the other one. That’s the transactional data that is more like simply like a

268 00:31:46.390 00:31:53.099 jonathan g: you could do like a row table or like a source table that can be used by other team members

269 00:31:53.560 00:31:56.489 jonathan g: when trying to what I’m going.

270 00:31:56.490 00:31:58.709 Awaish Kumar: What acid and waste stand for?

271 00:32:00.110 00:32:02.169 Awaish Kumar: Acid is an acronym. What does it.

272 00:32:02.170 00:32:02.540 jonathan g: Students

273 00:32:06.800 00:32:10.799 jonathan g: well for acids. From what I for letter C. This is screwed.

274 00:32:10.920 00:32:13.559 jonathan g: I mean. Sorry it’s not rude. It’s great.

275 00:32:13.890 00:32:18.850 jonathan g: I is for insert d is for delete a is for appendum.

276 00:32:20.460 00:32:21.750 Awaish Kumar: Sorry case. For what?

277 00:32:22.130 00:32:22.820 jonathan g: Append.

278 00:32:24.530 00:32:25.660 Awaish Kumar: A bend.

279 00:32:26.420 00:32:26.910 jonathan g: Pen.

280 00:32:27.360 00:32:32.830 Awaish Kumar: Okay? And so, and I is for.

281 00:32:34.440 00:32:35.320 jonathan g: Thanks, sir.

282 00:32:37.060 00:32:42.880 Awaish Kumar: Okay? So I stands for isolation. So isolation means.

283 00:32:43.640 00:32:47.409 Awaish Kumar: Okay, like, I can ask you more like, what do you?

284 00:32:47.810 00:32:52.460 Awaish Kumar: No, what isolation is in context of SQL,

285 00:32:52.990 00:32:55.990 Awaish Kumar: and what are different types of isolation levels.

286 00:32:57.600 00:33:06.270 jonathan g: Well, I’ll give an I’ll give a shot on answering this isolation question, but it’s I haven’t encountered in my

287 00:33:06.720 00:33:20.919 jonathan g: work experience when it comes to isolation. But okay, so for isolation, it’s more of a you are isolating a data error. That’s how I understand when it comes to isolation. So it’s more of a

288 00:33:21.110 00:33:23.460 jonathan g: is there like a data quality issue.

289 00:33:23.760 00:33:29.569 jonathan g: or say, formatting issue? Then from there you need to validate from your source.

290 00:33:29.980 00:33:31.790 jonathan g: That’s how I understand about it.

291 00:33:32.950 00:33:37.139 Awaish Kumar: And have you used the tools like airflow.

292 00:33:39.580 00:33:44.500 jonathan g: Airflow. I heard about airflow this more like an orchestrated group or open source.

293 00:33:44.500 00:33:46.479 Awaish Kumar: Orchestration tools? Have you used.

294 00:33:46.950 00:33:51.389 jonathan g: Dbt data build tool. Then there’s also for control M,

295 00:33:51.520 00:33:55.479 jonathan g: and also for aws, there is step, function.

296 00:33:56.420 00:33:57.150 Awaish Kumar: Okay?

297 00:33:58.640 00:34:03.779 Awaish Kumar: So in dbt, like, what kind of different features have you used.

298 00:34:05.450 00:34:13.330 jonathan g: In. Dbt, yeah, I do. Code refactor as well like you transform it into common table expression.

299 00:34:13.440 00:34:22.390 jonathan g: There’s also the file of the email. So you need to add project. For example, you’re going to add the data set. You’re going to add the

300 00:34:22.610 00:34:25.919 jonathan g: where it’s going to be stored, or which folder. It’s going to be stored.

301 00:34:25.920 00:34:29.460 Awaish Kumar: But I mean what is seed in DVD.

302 00:34:31.760 00:34:32.409 jonathan g: In.

303 00:34:33.150 00:34:34.869 Awaish Kumar: What is seed? Seed.

304 00:34:37.170 00:34:41.149 jonathan g: Oh, for seed haven’t encountered seed yet for.

305 00:34:41.803 00:34:44.070 Awaish Kumar: Have you used the macros.

306 00:34:45.370 00:34:51.969 jonathan g: Macros. Yes, the most common for macros is using preference and choice.

307 00:34:53.900 00:34:56.700 Awaish Kumar: Okay. But have you implemented Macros?

308 00:34:58.159 00:34:59.169 jonathan g: The customer.

309 00:34:59.170 00:34:59.760 Awaish Kumar: Thanks.

310 00:35:00.290 00:35:02.750 jonathan g: Not yet. Haven’t done custom market yet.

311 00:35:03.210 00:35:15.219 Awaish Kumar: Okay? And for Dvt, like, what are some like strategies for data, incremental data loading.

312 00:35:18.870 00:35:20.390 jonathan g: Can you repeat the question?

313 00:35:21.455 00:35:27.260 Awaish Kumar: In Dbt for incremental data loading. Right?

314 00:35:27.420 00:35:31.319 Awaish Kumar: What are different strategies that you can use in Dbt.

315 00:35:32.990 00:35:39.880 jonathan g: Or incremental load. Normally, you need to check if there is a duplicate when it comes to loading stuff.

316 00:35:40.000 00:35:45.950 jonathan g: So from there you need to apply the role number option in SQL,

317 00:35:46.440 00:35:49.779 jonathan g: so that it will only get the latest data.

318 00:35:52.150 00:35:52.960 Awaish Kumar: Okay,

319 00:35:54.570 00:35:57.690 Awaish Kumar: So do you know the concept of hooks in the DVD.

320 00:36:01.150 00:36:02.600 jonathan g: Or can you repeat on that.

321 00:36:02.600 00:36:09.769 Awaish Kumar: Do you know the concept of hooks in the Dbt, so there are some pre hooks and the post hooks?

322 00:36:10.080 00:36:13.310 Awaish Kumar: Oh, do you know anything about them?

323 00:36:14.720 00:36:25.030 jonathan g: I haven’t encountered books yet, unless you are referring to connect web hook connecting to Github or slack, but if that is not, I think I haven’t.

324 00:36:25.190 00:36:39.050 Awaish Kumar: Okay, so I am talking about hooks like in the Dbt, you can, for example, when obviously we connect with the database, we work with the database. Indeed! Dbt so the hook, pre hook or post hook is something

325 00:36:39.180 00:36:52.285 Awaish Kumar: I want to run a query. I want to run a execute a model. And then I want to say that. Okay, let’s while creating this while running this model. I created some

326 00:36:54.720 00:36:59.290 Awaish Kumar: like, I want to give permission to some like I want to. I want to create a table.

327 00:36:59.890 00:37:05.610 Awaish Kumar: and after that I want to make sure that Jonathan has the access to that table.

328 00:37:05.790 00:37:07.360 Awaish Kumar: I want to write some graph

329 00:37:07.770 00:37:10.419 Awaish Kumar: you can select on this table

330 00:37:10.550 00:37:21.059 Awaish Kumar: 2 user something like that. So I in the post hopes basically you can do that, you can run it from Kiri, like, which is a model in Dbt

331 00:37:21.270 00:37:29.650 Awaish Kumar: a model executes. And after that the post hook executes, and then basically put in the post hook, we can run any SQL commands.

332 00:37:34.320 00:37:36.909 jonathan g: If something looks I heard about it.

333 00:37:37.580 00:37:43.326 Awaish Kumar: Okay, what what about like?

334 00:37:45.070 00:37:47.160 Awaish Kumar: since you have worked with bigquery? Right?

335 00:37:50.080 00:37:51.500 Awaish Kumar: Alright.

336 00:37:51.960 00:37:53.488 Awaish Kumar: So what is like?

337 00:37:55.530 00:38:02.349 Awaish Kumar: Hmm and qualify? Keyword does in bigquery.

338 00:38:05.090 00:38:07.970 jonathan g: 25 keywords. Did I hear it correctly?

339 00:38:08.290 00:38:16.560 Awaish Kumar: Yes, yes, so like slack from where group I there is one more keyword it’s called qualify.

340 00:38:16.950 00:38:18.460 Awaish Kumar: Have you ever used it?

341 00:38:19.980 00:38:22.250 jonathan g: Having a usage or qualified.

342 00:38:23.150 00:38:27.204 Awaish Kumar: Okay, and what about have you

343 00:38:30.324 00:38:33.645 Awaish Kumar: like, did you know anything about?

344 00:38:35.580 00:38:39.820 Awaish Kumar: you already mentioned? Quite yeah. Cities. What? Like.

345 00:38:40.270 00:38:43.930 Awaish Kumar: what different window functions have you used in bigquery?

346 00:38:46.930 00:38:54.319 jonathan g: Yeah, I’ve encountered wrong number. I say, that’s the most common when it comes to window function. I think. All. SQL,

347 00:38:54.929 00:39:01.070 jonathan g: does it matter. If that is dbt bigquery or other database vendors, they’re using pro.

348 00:39:03.380 00:39:10.290 Awaish Kumar: Oh, and how would you? So do you know about different slowly changing dimension types.

349 00:39:13.100 00:39:17.130 jonathan g: And for the slowly changing types there is a CD type, one.

350 00:39:17.130 00:39:20.569 Awaish Kumar: There is a concept of slowly changing dimensions.

351 00:39:21.160 00:39:25.320 Awaish Kumar: And for slowly changing dimension. There are different types of it.

352 00:39:27.650 00:39:30.469 Awaish Kumar: And it’s it’s called Scd type

353 00:39:30.810 00:39:36.449 Awaish Kumar: 1 0 1, 2 like that. So can you elaborate more on this.

354 00:39:42.800 00:39:50.029 jonathan g: Well for the Scd type 0, I think, from what I understand is that it is just a

355 00:39:51.020 00:39:55.429 jonathan g: it’s simple extraction. So there’s no changes that needs to be included.

356 00:39:55.600 00:40:01.840 jonathan g: whereas for type one, there is a data that needs to be transformed.

357 00:40:02.030 00:40:06.019 jonathan g: and for the second one that applies also for

358 00:40:06.190 00:40:10.489 jonathan g: the not just the data, but also the column name as well.

359 00:40:11.200 00:40:13.530 jonathan g: When when you want to do something.

360 00:40:14.870 00:40:26.230 Awaish Kumar: And no, for the oops, slowly changing like for the slowly changing dimensions

361 00:40:28.790 00:40:34.930 Awaish Kumar: like type 2 like you mentioned that like what is.

362 00:40:36.200 00:40:41.160 Awaish Kumar: how like if I want to implement a CD type 2 for my one of my tables.

363 00:40:41.930 00:40:44.259 Awaish Kumar: how how can I implement that.

364 00:40:46.530 00:40:51.979 jonathan g: So from your source, you need to extract, then you have, like a transformation in the middle.

365 00:40:52.100 00:41:00.200 jonathan g: From there you need to change your. You do need to do mapping. So originally, your source is just a

366 00:41:01.360 00:41:15.349 jonathan g: let’s say Id. Then there’s the integer, then for your target. You want to do it as string. Then there’s also an underscore on the 1st name of Id, so that will be like underscore Id.

367 00:41:15.460 00:41:17.730 jonathan g: Then it will be changed to 3.

368 00:41:20.080 00:41:22.708 Awaish Kumar: Okay, so yeah, like

369 00:41:24.040 00:41:26.720 Awaish Kumar: that was all the questions from my side.

370 00:41:27.040 00:41:33.010 Awaish Kumar: So now, if you want to ask anything about brain force. And what do we do here? Or

371 00:41:34.220 00:41:35.600 Awaish Kumar: please go ahead.

372 00:41:36.160 00:41:38.030 jonathan g: Mind if I asking That

373 00:41:38.360 00:41:43.850 jonathan g: is spring. Forge is like a startup company, or this is already an established company.

374 00:41:46.060 00:41:51.150 Awaish Kumar: Brain forager is a Bootstrap Startup Company.

375 00:41:51.700 00:41:56.180 Awaish Kumar: We are only a team of like, as I mentioned, 10 to 15 people

376 00:41:56.630 00:41:59.660 Awaish Kumar: working on data and AI consultancy services.

377 00:41:59.950 00:42:06.029 Awaish Kumar: And we provide flexibility to work from anywhere in the world.

378 00:42:06.420 00:42:13.600 Awaish Kumar: Also with the flexibility to work with any kind of engagement like full time, part time

379 00:42:14.370 00:42:16.029 Awaish Kumar: with with my information.

380 00:42:19.680 00:42:24.249 jonathan g: So, and from that one participant

381 00:42:24.410 00:42:32.579 jonathan g: is this like a new position? Or this is like a additional role. Additional account for this role.

382 00:42:32.580 00:42:33.350 Awaish Kumar: Sorry.

383 00:42:33.940 00:42:37.620 jonathan g: Is this a new open role, or additional.

384 00:42:39.805 00:42:44.244 Awaish Kumar: So this is like, obviously new role

385 00:42:45.260 00:42:54.490 Awaish Kumar: in a sense that, as I mentioned, we are a data AI consistency from we continue to get different clients

386 00:42:55.070 00:43:02.689 Awaish Kumar: to work on their data. And for that we are the our like data in AI roles are always open.

387 00:43:03.550 00:43:07.849 Awaish Kumar: So we are always looking out for data people.

388 00:43:08.700 00:43:13.309 Awaish Kumar: Because we are continue to get like data projects.

389 00:43:13.720 00:43:18.920 Awaish Kumar: And yeah, so we are always looking for new people to come and try us.

390 00:43:19.940 00:43:23.360 jonathan g: Oh, okay, I have a question. Speaking of AI.

391 00:43:23.500 00:43:28.449 jonathan g: Is your, is the company open for AI whenever you are.

392 00:43:28.550 00:43:35.839 jonathan g: do? If yes, do you utilize the AI, or do you have, like an existing tool, right.

393 00:43:38.090 00:43:49.880 Awaish Kumar: So there are 2 2 things in terms of AI. What I said so number is, one is like as an AI engineer. You develop the the systems for the clients, for the internal teams.

394 00:43:50.751 00:43:54.188 Awaish Kumar: There’s there’s the development part. The second part is

395 00:43:55.030 00:44:03.140 Awaish Kumar: the using of AI tools to to improve our what performance

396 00:44:03.190 00:44:14.590 Awaish Kumar: is it is a is a different thing. So we we have both. So like our our company has AI engineers which are building tools for our clients which are making like these goals

397 00:44:14.640 00:44:36.720 Awaish Kumar: and different AI services for our clients and for our internal teams. But on the other side we are very actively using AI in our any kind of development and data engineering data analytics. So if you want to build some models you can use, you’re free to use any AI tool we have like a cursor id to to do that. We have

398 00:44:37.120 00:44:42.490 Awaish Kumar: chat Gpt subscription. We have. We are using azure open. AI. But

399 00:44:42.995 00:44:52.489 Awaish Kumar: if there is something which is really useful for the team members, we are okay, and open to get that for our team.

400 00:44:54.320 00:45:07.119 jonathan g: Thanks for sharing that one also any question. So for this phone is the team very flexible on their work schedule, like what I mean to say that they can work during their time zone.

401 00:45:07.510 00:45:17.480 jonathan g: For example, if you are in the Us. You can work in your day, shift schedule for Philippines. You can work on your time schedule. Then.

402 00:45:17.750 00:45:22.220 jonathan g: if there’s a meeting, then you just attend afterwards. Call it a day.

403 00:45:22.390 00:45:24.490 jonathan g: That’s something on your team.

404 00:45:25.780 00:45:31.420 Awaish Kumar: Yeah, like, we are very flexible with the time zone. You can work at any time wherever you are.

405 00:45:33.320 00:45:36.970 jonathan g: Okay, that’s good to hear. Then, also.

406 00:45:38.820 00:45:41.320 jonathan g: yeah, what will be the next step after this interview?

407 00:45:42.045 00:45:47.739 Awaish Kumar: Yeah, like my team like, our operation team will connect with you after some time

408 00:45:48.590 00:45:51.219 Awaish Kumar: on this, like for the next steps in this week.

409 00:45:52.090 00:45:54.439 jonathan g: Okay. How long should I wait for your feedback.

410 00:45:55.380 00:45:58.869 Awaish Kumar: Yeah, I mean, in this week our our team is going to reach out.

411 00:45:59.730 00:46:00.240 Awaish Kumar: Okay.

412 00:46:00.240 00:46:02.510 jonathan g: Right? Yeah. Thanks. Aish for your time.