Meeting Title: Brainforge Final Interview Date: 2026-03-24 Meeting participants: Mouhamad, Samuel Roberts, Uttam Kumaran
WEBVTT
1 00:02:42.320 ⇒ 00:02:43.350 Samuel Roberts: Totally.
2 00:02:51.910 ⇒ 00:02:53.460 Mouhamad: Hi, Sam, how are you?
3 00:02:53.880 ⇒ 00:02:55.400 Samuel Roberts: Can you hear me alright?
4 00:02:55.730 ⇒ 00:02:57.689 Mouhamad: Yes, I can hear you now. Can you hear me?
5 00:02:57.690 ⇒ 00:02:59.900 Samuel Roberts: Yes, perfect, okay.
6 00:03:01.430 ⇒ 00:03:06.209 Samuel Roberts: Alright, so I’ll just give it a few minutes for Uten to join.
7 00:03:06.750 ⇒ 00:03:07.290 Mouhamad: So.
8 00:03:09.890 ⇒ 00:03:11.880 Samuel Roberts: And then we can get started.
9 00:03:13.520 ⇒ 00:03:18.760 Mouhamad: I think I didn’t get the chance to look at the code, so I sent it. I didn’t see Kayla’s email.
10 00:03:18.760 ⇒ 00:03:25.549 Samuel Roberts: Yeah, sorry about that. I should have seen the email earlier, but I was saving the email to review today, and then by the time I looked at it, I saw
11 00:03:26.060 ⇒ 00:03:29.620 Samuel Roberts: It’s the… but we can… it’s fine, we can figure out the best way to…
12 00:03:29.620 ⇒ 00:03:30.290 Mouhamad: Yeah, that’s fine.
13 00:03:30.290 ⇒ 00:03:34.039 Samuel Roberts: Presentation probably is a pretty good place to start, and then I have…
14 00:03:34.040 ⇒ 00:03:43.060 Mouhamad: Yeah, I have a presentation, and I also did, I did… I’m gonna also do a simulation, so I did also a UI for you guys to see.
15 00:03:43.240 ⇒ 00:03:46.529 Mouhamad: So I would also… I would also like to do, like, a quick demo also, though.
16 00:03:47.330 ⇒ 00:03:47.970 Samuel Roberts: Great.
17 00:03:48.630 ⇒ 00:03:51.510 Samuel Roberts: Alright, yeah, let’s get… There we go, perfect.
18 00:03:56.670 ⇒ 00:03:57.440 Uttam Kumaran: Yeah, right?
19 00:03:58.660 ⇒ 00:03:59.380 Samuel Roberts: Yes.
20 00:04:00.460 ⇒ 00:04:01.950 Mouhamad: Hello, Louis, who can hear you.
21 00:04:02.360 ⇒ 00:04:03.500 Uttam Kumaran: Hey, how’s everything?
22 00:04:04.710 ⇒ 00:04:07.650 Mouhamad: Hi, Tom, how are you? Good, good. How’s everything with you?
23 00:04:07.990 ⇒ 00:04:11.079 Uttam Kumaran: Good, good. I’m excited to be here. Thank you for taking the time.
24 00:04:11.650 ⇒ 00:04:13.280 Mouhamad: Thank you, thank you for joining.
25 00:04:14.470 ⇒ 00:04:30.039 Uttam Kumaran: Cool, yeah, maybe we can get started. I don’t know if we… I think… I don’t know, Sam, if you’ve met… if you guys have already met through the process, but I can do… I can do an introduction, and then maybe we can kind of hand it back to you, Sam. So, it’s really great to meet you. You pronounce your name Muhammad?
26 00:04:30.680 ⇒ 00:04:31.790 Mouhamad: Yes, correct.
27 00:04:32.180 ⇒ 00:04:36.790 Uttam Kumaran: Yeah, it’s great to… great to meet you. I feel like, heard a lot of great things through the process, so I’m so.
28 00:04:36.790 ⇒ 00:04:37.590 Mouhamad: Oh, goodness.
29 00:04:37.970 ⇒ 00:04:38.940 Uttam Kumaran: your.
30 00:04:38.940 ⇒ 00:04:40.020 Mouhamad: I’m glad.
31 00:04:40.020 ⇒ 00:04:51.479 Uttam Kumaran: your system today, and yeah, I run Brainforge, we’re a data and AI consultancy, so, doing a lot of… a lot of AI work, a lot of data work for clients, we’re growing really quickly.
32 00:04:51.790 ⇒ 00:05:07.050 Uttam Kumaran: trying to stay, up-to-date on everything in AI as well. And, yeah, just, like, super excited to continue to build a team. I think we’ll try to keep some time for questions, but yeah, feel free, Sam, you can kind of take a lead from here.
33 00:05:08.140 ⇒ 00:05:09.850 Samuel Roberts: Yeah, sure. So,
34 00:05:09.970 ⇒ 00:05:29.070 Samuel Roberts: we were just discussing, so he had sent a presentation, and then I went to go look at it, realized the code wasn’t there. He sent the code a little while ago, so I only had a little bit of time to start going through that, but I think if we could just start, maybe, with the, with what you have kind of prepared for us to show, and then we’ll… we can probably just
35 00:05:29.070 ⇒ 00:05:40.340 Samuel Roberts: interject with questions, throughout that. I have a few things I kind of want to make sure we hit, but I kind of want to see if we hit that before just jumping right in. So, if you could, kind of take it away.
36 00:05:40.600 ⇒ 00:05:42.840 Samuel Roberts: If you want to share screen.
37 00:05:43.110 ⇒ 00:05:46.410 Mouhamad: Sure, sure. Let me first share the presentation.
38 00:05:46.410 ⇒ 00:05:47.080 Samuel Roberts: Sure.
39 00:05:52.740 ⇒ 00:05:54.940 Mouhamad: feed orientation. Why?
40 00:05:55.110 ⇒ 00:05:55.900 Mouhamad: Okay.
41 00:05:56.320 ⇒ 00:05:57.960 Mouhamad: Entire screen done.
42 00:05:59.810 ⇒ 00:06:03.010 Mouhamad: Okay… Can you see my screen, though?
43 00:06:04.220 ⇒ 00:06:04.790 Samuel Roberts: Yes.
44 00:06:04.790 ⇒ 00:06:05.490 Uttam Kumaran: Yes.
45 00:06:07.400 ⇒ 00:06:08.170 Mouhamad: Slip.
46 00:06:09.090 ⇒ 00:06:21.499 Mouhamad: Okay, thank you, Josh, for joining. First, this is basically the presentation about the project. So, the project is a product compliance pipeline. So, like, my way of seeing it is it, like.
47 00:06:21.500 ⇒ 00:06:34.350 Mouhamad: deterministic where correctness matters, and AI is just where structure breaks down. It’s not just about, like, AI, just put LLM everywhere, doesn’t work that way. No, it’s just AI where the structure actually breaks down, it matters.
48 00:06:34.560 ⇒ 00:06:41.379 Mouhamad: So… what’s the problem in the compliance system? Like, there are multiple problems. First is the…
49 00:06:41.550 ⇒ 00:07:00.950 Mouhamad: the format of the input. So, PDFs can come, text files, JPEG, or images, or anything. So, there are multiple formats of input, it’s just not… just one format, and if you want to just convert it to one format, it doesn’t work that way. Another thing is that hidden ingredients, so…
50 00:07:01.650 ⇒ 00:07:15.029 Mouhamad: Yes, you can find a well-structured folder where you have, like, ingredients, and you have, like, for example, ingredients section, and inside of it, you can find, actually, the ingredients, etc. Well, it doesn’t work that way all the time. Sometimes it’s buried in paragraphs.
51 00:07:15.030 ⇒ 00:07:27.939 Mouhamad: written as chemical formulas, trade names, abbreviations, etc. OCR noise, that’s, that’s very, very common, especially if you’re working with Tesseract on numbers, or, working also with…
52 00:07:28.040 ⇒ 00:07:36.740 Mouhamad: any, any OCR, even LLMs can hallucinate, especially, like, OpenAI models can hallucinate on numbers, especially if Arabic or anything.
53 00:07:36.740 ⇒ 00:07:49.319 Mouhamad: So scanned documents for OCR mode, there’s damage on this, there’s also this problem. AI hallucination risk, AI hallucinate, and we all know this, and there’s nothing we can do. Yes, we can minimize hallucinations through techniques.
54 00:07:49.350 ⇒ 00:07:58.780 Mouhamad: But still, LLM can invent ingredients that aren’t in the document. So, blind trust, basically, for the LLM is a false rejection. Negation traps, so…
55 00:07:58.990 ⇒ 00:08:10.680 Mouhamad: some inside the documents can be, like, doesn’t contain, or something like that. This should not trigger the rejection, even if the name of the banned ingredient is there.
56 00:08:10.760 ⇒ 00:08:19.249 Mouhamad: Also, there’s, like, the synonym explosion, so same substance, underdosing, etc, etc, names. So…
57 00:08:19.330 ⇒ 00:08:28.400 Mouhamad: The proposed architecture that I went in for here is just… I always try to make it simple, so it’s just a four-stage pipeline.
58 00:08:28.440 ⇒ 00:08:41.479 Mouhamad: So, each independent, tested, audit will, etc. The first pillar is the file extraction, so whatever the format is, extracting the file to text. Then ingredient extraction, so this uses pipeline, AI, parser.
59 00:08:41.600 ⇒ 00:08:44.039 Mouhamad: Then the third one is the forbidden matching.
60 00:08:44.330 ⇒ 00:08:48.270 Mouhamad: There are four layers on the forbidden macro, then at the end is just the decision.
61 00:08:48.570 ⇒ 00:08:55.780 Mouhamad: Basically, AI improves recall, but every decision is deterministic and rule-based, so the LLM is never the final judge.
62 00:08:57.240 ⇒ 00:09:07.069 Mouhamad: how the design evolved, it’s not just like, oh, I came up with a design quickly and then started coding. No, it didn’t work that way. So what I did is that the first
63 00:09:07.490 ⇒ 00:09:13.390 Mouhamad: thought about the design entirely, was just using deterministic parsing, so… false and predictable.
64 00:09:13.460 ⇒ 00:09:32.600 Mouhamad: So, for example, if there’s, like, a section called ingredient, for example, the parts after it are gonna be ingredients, but it’s weak on messy OCR unstructured files, where regex doesn’t work, for example, and, like, it does miss, like, ingredients buried in paragraphs. So this is where I added another layer, which is the AI extraction.
65 00:09:32.810 ⇒ 00:09:36.340 Mouhamad: which improved recall significantly. AI catch.
66 00:09:36.400 ⇒ 00:09:39.500 Mouhamad: Like, catches what projects cannot catch.
67 00:09:39.560 ⇒ 00:09:58.669 Mouhamad: But it also introduces a problem, which is the hallucination, as I spoke before. So AI can invent ingredients in the documents that are not in the document, actually. This is where I wanted something to ground it, where this is where I added, like, the evidence-grounded AI, where there was, like, also, like, snippet verification, so…
68 00:09:58.800 ⇒ 00:10:02.219 Mouhamad: I almost always return, like, a snippet of evidence.
69 00:10:02.420 ⇒ 00:10:15.980 Mouhamad: if it doesn’t work that way, it’s not audible properly, and we’re gonna look into it later on. Then I wanted to just move from this section to, like, the forbidding matching, so I wanted to do, like, a forbidding, but I didn’t want to do it
70 00:10:16.010 ⇒ 00:10:31.149 Mouhamad: just give it to the LLM, and the LLM can compare, can also make mistakes on this. So this is where I added, like, a four-section, or four-layer for the forbidding matching. Also, we’re gonna talk to them in detail. At the end, I was like.
71 00:10:31.510 ⇒ 00:10:44.499 Mouhamad: There’s also, like, the negation detection, where there’s, like, a freeform on anything. So yeah, this is the final verdict that I added also as well. This is all in all shaped, like, the trajectory of the project and how it evolved.
72 00:10:44.810 ⇒ 00:10:47.180 Mouhamad: So, why choosing this design?
73 00:10:47.500 ⇒ 00:11:01.629 Mouhamad: simple, because stage one, for example, file extraction, like, file arrives in text, PDFs, images, anything. So, it’s as simple as this, just normalize into raw text, and the rest for the pipeline. Deterministic parser, as I said.
74 00:11:01.810 ⇒ 00:11:07.179 Mouhamad: Structured ingredient lists are better handled by rules, fast, cheap, productive baseline.
75 00:11:07.180 ⇒ 00:11:22.470 Mouhamad: Also, one of the reasons why I choose, like, domestic parser and AI, I’ll talk about it later on, and not just the AI. AI extractor, real documents are noisy, OCR, damage, missing headers, etc. Trust-aware merge, so…
76 00:11:22.470 ⇒ 00:11:29.970 Mouhamad: AI, as I said, is not trusted blindly. If you’re trusted blindly, this is wrong, so this is where there is, like, a merge detection, or
77 00:11:30.160 ⇒ 00:11:45.889 Mouhamad: smart merge aware. There’s a layered matcher, and the negation handling, and deterministic layer at the end. Deterministic layer is just the final accept-reject, of course, can go more than this, but for the sake of this project, it was just, like, accept or reject.
78 00:11:46.270 ⇒ 00:11:50.219 Mouhamad: So, starting from the first pillar is the file extraction.
79 00:11:50.380 ⇒ 00:11:55.670 Mouhamad: what it does is that, basically, it can handle text or CSV,
80 00:11:55.820 ⇒ 00:12:03.499 Mouhamad: PDF, images, etc. The way that I read the data from the text and the CSV is simple.
81 00:12:03.740 ⇒ 00:12:13.140 Mouhamad: is just using UTF-8, and you read it fast, cleanest path, everything, etc. The PDF can go into two directions.
82 00:12:13.200 ⇒ 00:12:24.160 Mouhamad: either, it’s, like, a text immediately inside the PDF, and can use PyMUPDF to do it, but the other one is if it’s, like,
83 00:12:24.500 ⇒ 00:12:38.789 Mouhamad: like a… like a scanned images inside the PDF. So, how do I detect both? All I would do is just use PyME PDF immediately. If the returns are more than 20… or less than 20 characters.
84 00:12:38.930 ⇒ 00:12:46.910 Mouhamad: then there is no text vault inside of this, so probably this is, like, a scanned image. So fall back to an OCR page by page and do it.
85 00:12:47.010 ⇒ 00:13:01.410 Mouhamad: images immediately using OCR via Tesseract, and if very little information got returned from it, it just emits a low extraction warning. So everything also
86 00:13:01.560 ⇒ 00:13:17.019 Mouhamad: there’s a lot of warnings, like, for example, you can see in here, there’s warnings, so OCR fallback used, low extraction, unsupported fire. This is all used for the second… for, like, deterministic, but for more of the AI part inside the second stage.
87 00:13:17.270 ⇒ 00:13:21.890 Mouhamad: Where it goes now to the second stage for the deterministic part. So now, in here…
88 00:13:22.070 ⇒ 00:13:24.370 Mouhamad: So there are a couple of things.
89 00:13:24.810 ⇒ 00:13:35.059 Mouhamad: as I said, I want it to be here as code. The reason also why I wanted also this part to be inside in here is because, let’s say, the AI part was not…
90 00:13:35.350 ⇒ 00:13:49.070 Mouhamad: like, the ANV was missing something, or was missing an OpenAI key, or was missing a key, or missing anything for the AI, the code will not break. It will always work on the terministic parser, even if the terminalistic parser is not
91 00:13:49.180 ⇒ 00:14:01.190 Mouhamad: the best in here, or maybe it was messy, unsupported files and didn’t work, but at least the code would still work and doesn’t break fully. So, this part, how it works, there is…
92 00:14:01.540 ⇒ 00:14:06.819 Mouhamad: couple of things to it. So, first is the section detection. So, trying to detect
93 00:14:06.840 ⇒ 00:14:22.990 Mouhamad: where is the section? So, I’m using regex to search for things like ingredients, composition, active ingredients, etc. All patterns live in a JSON files, so they are all configurable, no hard coding, nothing inside of it.
94 00:14:23.000 ⇒ 00:14:28.699 Mouhamad: Then, another thing, after detecting the section, is detecting the stop pattern bounding. So.
95 00:14:28.940 ⇒ 00:14:43.010 Mouhamad: if, let’s say, a section is, like, an ingredient, ingredient, then two points, then you have the ingredient inside. I would like also to detect where the stop patterns, so it’s not, like, second paragraph or anything. So this is also using
96 00:14:43.210 ⇒ 00:14:56.419 Mouhamad: a stop pattern bounding detection, where it handles, like, directions, blank pages, new header, etc. Much more robust than a single project. Candid splitting, also it handles, like, comma separations, semicolon, bullets, so if I find, like, multiple
97 00:14:56.420 ⇒ 00:15:07.229 Mouhamad: Candidates inside the ingredient list or anything, just trying to split them using, like… because ingredients is gonna be, like, a common twin, or bullet or anything like that.
98 00:15:07.420 ⇒ 00:15:13.750 Mouhamad: Then at the end, a small filtering. Length bounce, so 2 to 100 characters, so…
99 00:15:13.940 ⇒ 00:15:26.310 Mouhamad: to detect if the ingredient is actually an ingredient, or there’s, like, some sort of error, or some sort of, like, non-ingredient inside the ingredient section, or anything like that. So, just basically noise detection in general.
100 00:15:26.370 ⇒ 00:15:36.950 Mouhamad: And there is also the confidence scoring in here. So, for the confidence scoring, this is an interesting one. So, the reason for this is the confidence scoring will direct
101 00:15:37.290 ⇒ 00:15:40.529 Mouhamad: the… the AI. So, for the confidence scoring.
102 00:15:40.990 ⇒ 00:15:51.919 Mouhamad: I have… if you can see in here, so there’s section header found. So if the section header was found, there’s an additional 0.35 to the score that gets added.
103 00:15:52.040 ⇒ 00:15:57.399 Mouhamad: If it’s actually clearly bounded, so I can find, like, the bound inside of it, there’s an additional 0.1.
104 00:15:57.880 ⇒ 00:16:01.320 Mouhamad: Product name found and ID found, this is 0.1.
105 00:16:01.450 ⇒ 00:16:09.369 Mouhamad: If I found more than 3 candidates, it’s additional 0.1, more than 6 candidates, 0.1, more than 10 candidates, 0.05,
106 00:16:09.490 ⇒ 00:16:27.669 Mouhamad: These are, like, all adding to the score. Now, there are also penalties, like, it’s not just, like, blindly just, like, adding stuff. So, if it’s more than 30%, like, sentences, so let’s say if more than 30% of what was found
107 00:16:27.800 ⇒ 00:16:32.150 Mouhamad: Was… look like sentences, so, based on the characters.
108 00:16:32.260 ⇒ 00:16:36.819 Mouhamad: There’s a… there’s a penalty of 0.25, so minus 0.25.
109 00:16:36.960 ⇒ 00:16:38.909 Mouhamad: If it’s more than…
110 00:16:38.990 ⇒ 00:16:49.820 Mouhamad: 20% of what was found was very short, so less than 4 characters, there is a penalty of 0.1. And full text fallback used, also there is a 0.2 also penalty.
111 00:16:49.820 ⇒ 00:17:03.029 Mouhamad: So, these are all, calculations, the confidence score. So, also, it’s not like the confidence. No, this is all just code. And the reason for doing confidence score in here is because of… in here. So.
112 00:17:03.800 ⇒ 00:17:12.559 Mouhamad: Based on the confidence score, there are two directions. Either confirm and fill, or open extraction. What’s the difference between the two? So…
113 00:17:13.050 ⇒ 00:17:22.809 Mouhamad: If 2A was strong, I don’t want to, like, eliminate the work that 2A does, but I want to build on it, so…
114 00:17:23.339 ⇒ 00:17:28.409 Mouhamad: If the confidence is smaller than 0.45, and you might say, like, why 0.45?
115 00:17:28.610 ⇒ 00:17:47.499 Mouhamad: It’s just a number for now, but of course, in real development or environment or production, there will be, like, a lot of testing, a lot of cases, a lot of clients’ data or anything, so this number in here should be, like, realistically not 0.45, but it just depends on the data.
116 00:17:47.660 ⇒ 00:17:53.549 Mouhamad: And there are more than 2 candidates that were found, and a section was found, so if these three were…
117 00:17:54.120 ⇒ 00:17:55.779 Mouhamad: all there.
118 00:17:55.910 ⇒ 00:18:08.280 Mouhamad: The mode that is going to be is confirm and fill. So, 2B AI extractor has, for the confirm and fill, has its own, prompt, has its own direction.
119 00:18:08.460 ⇒ 00:18:18.449 Mouhamad: So the AI receives the document, plus deterministic candidate, and the task for this is validate what was found, don’t throw it away, just validate it, add anything missed.
120 00:18:19.080 ⇒ 00:18:27.969 Mouhamad: And the reason for this is this is cheaper, this is more focused, because we’re limiting the AI’s mind, or the AI’s work.
121 00:18:28.090 ⇒ 00:18:29.270 Mouhamad: Not just, like.
122 00:18:29.800 ⇒ 00:18:38.690 Mouhamad: take the text and do everything. No. There was already some evidence, just see if the evidence are correct, because also, what I’m doing from…
123 00:18:39.060 ⇒ 00:18:40.649 Mouhamad: 2A in here.
124 00:18:40.900 ⇒ 00:18:42.950 Mouhamad: It’s also getting, like…
125 00:18:43.140 ⇒ 00:18:56.959 Mouhamad: snippets, everything, so every candidate will take the 80 characters before, and the actual ingredient, and then the 20 character data, so there is, like, evidence with it. This is all given to the LLM in here for this one.
126 00:18:57.140 ⇒ 00:19:03.490 Mouhamad: To… to say, like, validate what was found, and add anything that was missed, because also we’re not gonna be like, oh.
127 00:19:03.490 ⇒ 00:19:07.870 Uttam Kumaran: It’s almost like judgment. Like, you’re giving it all the evidence, and you’re having to make the judgment. Okay.
128 00:19:08.120 ⇒ 00:19:24.099 Mouhamad: Exactly, exactly. I’m giving it as a judgment. I don’t want to throw the work. Or, there’s another route, which is the open extraction. So, this one, if the confidence was low, if the candidates were not… were not found, like, entirely, if the section was
129 00:19:24.290 ⇒ 00:19:26.240 Mouhamad: Was not found or anything.
130 00:19:26.320 ⇒ 00:19:39.610 Mouhamad: And here, AI receives only the document text, so I’m giving the document text to the AI, and with a prompt, with a different prompt than this one as well. And it’s saying, extract all ingredients from scratch.
131 00:19:39.610 ⇒ 00:19:47.159 Mouhamad: It is more expensive, but it’s necessary, especially on messy OCRs, on… on text that was not properly found or anything.
132 00:19:47.460 ⇒ 00:19:48.490 Mouhamad: And…
133 00:19:48.500 ⇒ 00:20:02.600 Mouhamad: I’m sorry. And the trust anchor is basically every AI ingredient is expected to return a source snippet, so a verbaten phrase copied from the document. So that snippet is then verified against the extracted raw attack, so…
134 00:20:02.600 ⇒ 00:20:21.330 Mouhamad: what happens after this? Simple. So, new AI-only item plus verified snippet, it will enter a matching as AI-only verified. If a new AI-only item plus unverified snippet, quarantined. Because in here, it might just say, like, oh, the LLM has hallucinated or anything.
135 00:20:21.540 ⇒ 00:20:32.129 Mouhamad: And there is overlap with the domestic item, then always comes both, both of them. And overlap grounding is traced separately, via, like, the AI snippet verified, etc. So…
136 00:20:32.520 ⇒ 00:20:41.969 Mouhamad: again, as I said, the reason also for doing this, if the AI call fails, the pipeline continues with its own recycle, so the code never fails, it will still work its way.
137 00:20:42.380 ⇒ 00:20:43.440 Mouhamad: After
138 00:20:43.730 ⇒ 00:20:52.080 Mouhamad: getting this, now we have 2A and we have 2B together. There’s a trust-aware merge. So what happens in here? So…
139 00:20:52.380 ⇒ 00:20:59.069 Mouhamad: Let’s say 2A found some ingredient, and 2B also found the same ingredient, also verified it with some snippets, etc.
140 00:20:59.180 ⇒ 00:21:03.910 Mouhamad: founded, then both of them found, found by deterministic person and AI,
141 00:21:04.030 ⇒ 00:21:09.940 Mouhamad: Enters matching? Yes. Yes, overlaps, provenance, so AI grounding traits, so both of them have…
142 00:21:10.730 ⇒ 00:21:21.359 Mouhamad: shaked hands, everything. So both of them have agreed on something. Amazing. Now, what if deterministic only? So, deterministic… so 2A found something.
143 00:21:21.600 ⇒ 00:21:23.379 Mouhamad: But AI didn’t found it.
144 00:21:23.720 ⇒ 00:21:26.360 Mouhamad: Is the… is the interforming?
145 00:21:26.620 ⇒ 00:21:30.200 Mouhamad: Yes, baseline parser case. So…
146 00:21:30.580 ⇒ 00:21:46.790 Mouhamad: It can enter parsing, and this can… this can also, like, be improved even more and more in this stage, in this area, like, what… what… then the next step that can be done in here. But it can enter, but it also can be wrong, because, yes, code found it, but it might be wrong or anything.
147 00:21:47.190 ⇒ 00:22:03.409 Mouhamad: AI only verified, so let’s say found only by AI, just, like, in this in here, and confirm and fill, so add more things. Found only by AI, but there is the snippet with it. Yes, enter matching grounded, because there is a snippet, there is evidence that this is correct.
148 00:22:03.650 ⇒ 00:22:07.229 Mouhamad: Let’s say, and found only by
149 00:22:07.400 ⇒ 00:22:18.850 Mouhamad: AI bot snippet was not verified, this is why we quarantined. So, if the AI found it, and there was no evidence to it, no, I’m rejecting it. I don’t trust it, and I don’t want to trust it.
150 00:22:19.110 ⇒ 00:22:19.990 Mouhamad: So…
151 00:22:20.310 ⇒ 00:22:36.619 Mouhamad: As you can see in here, for example, AI returns ingredient caffeine, which snippet contains caffeine as stimulant. Both snippet and raw text are normalized, lowercase, collapsed, etc. Is normal snippet a substring of normal raw? Yes, AI only verified with some snippet.
152 00:22:36.910 ⇒ 00:22:40.119 Mouhamad: Then enters matching. If no, then quarantine.
153 00:22:40.830 ⇒ 00:23:00.089 Mouhamad: In here, there is just an example where I explain how it happens. For example, you have a document where you have daily glow clean, this is the product name, and this is the product ID, with a description, with ingredients, with direction, stories, etc. 2A would run, and would say, okay, this is metadata extraction, I’m gonna extract these two, because there is, like, a product name, product ID, amazing.
154 00:23:00.900 ⇒ 00:23:01.800 Mouhamad: the…
155 00:23:03.790 ⇒ 00:23:16.770 Mouhamad: Okay, I’m gonna… the code’s still continuing, I will find ingredients, so it’ll say, like, find header, ingredients, amazing, this is the starting of the section. Section bounding, these are all the information inside of it, so it cuts over here because there’s…
156 00:23:16.780 ⇒ 00:23:25.599 Mouhamad: Like, a white space. So, cut section before directions, and then you can get the candidates splitting by 6 candidates, by commas, and you can get them
157 00:23:25.660 ⇒ 00:23:36.319 Mouhamad: Then, per candidate evidence snippets, so you take per candidate, there’s a snippet of, as I said, 80 characters before, with the candidate, with 20 characters after. Then, what happens is.
158 00:23:36.400 ⇒ 00:23:37.459 Mouhamad: As I said.
159 00:23:37.650 ⇒ 00:23:47.709 Mouhamad: Section found, yes. Section bound, yes. Product found, yes. More than 300, yes. More than 600, yes. Penalties, no penalties. So the confidence is 0.75 after just adding them.
160 00:23:47.950 ⇒ 00:23:52.800 Mouhamad: AI section mode is confirm and fill. So, the AI then
161 00:23:52.960 ⇒ 00:24:05.340 Mouhamad: would receive, like, the document text plus 6 deterministic candidates. The task is confirm real item, add anything missing, return source snippet for each. So, this is what the AI does in here. Then the merge logic.
162 00:24:05.540 ⇒ 00:24:20.699 Mouhamad: If the AI found both of them, found these 6, and confirmed everything is correct, etc, blah blah blah, didn’t find anything more, you can see in here the origin of these candidates are both, so snippet verified with it, yes, so status is trusted.
163 00:24:20.920 ⇒ 00:24:29.330 Mouhamad: Now, what happens if the confidence was not high? So, something like this. So, this product has been carefully formulated using the financial region available on the market.
164 00:24:29.710 ⇒ 00:24:34.440 Mouhamad: I kind of had these ones with added skin softening agents, so…
165 00:24:34.620 ⇒ 00:24:48.999 Mouhamad: This is noisy, because it kind of can say, okay, I found the ingredients in here. This is like a section bounding in here. Okay, these are the ones, so it tastes like this product has been carefully, etc, etc, etc, these are the…
166 00:24:49.130 ⇒ 00:25:01.619 Mouhamad: It can… it maybe can detect just this one as correct. Now, the confidence is low? Yes. Because even though the section was found by ingredient, more than 3 candidates, yes.
167 00:25:01.970 ⇒ 00:25:03.759 Mouhamad: More than 600, maybe yes.
168 00:25:04.160 ⇒ 00:25:07.250 Mouhamad: But this is where the penalty fires now.
169 00:25:07.490 ⇒ 00:25:20.559 Mouhamad: So, you will have, for example, the… the long item penalty files, so minus 0.25, and the short item penalty file, also minus 10. So, you can see in here the confidence is 0.2.
170 00:25:20.630 ⇒ 00:25:32.389 Mouhamad: So, even though the section was found, one of the… one of the conditions, but the confidence is low, then it goes to the open extraction, where AI receives the document in entirety.
171 00:25:32.480 ⇒ 00:25:35.740 Mouhamad: And what it does in here, so…
172 00:25:35.880 ⇒ 00:25:39.449 Mouhamad: The merge still starts with other tournament candidates, noise included.
173 00:25:39.520 ⇒ 00:25:53.619 Mouhamad: And the AI takes it, it takes everything, but you can see in here, for example, in here it was just deterministic only, even though these are not correct, the AI didn’t verify them, they said, like, it’s not correct. But only this one was both verified.
174 00:25:53.630 ⇒ 00:26:03.419 Mouhamad: So, in here, why it still works? Because step 2 does not filter, it’s noisy candidates, the merge only upgrades, adds verified item. Step 3 is the filters, naturally.
175 00:26:03.440 ⇒ 00:26:13.339 Mouhamad: Where it filters all of these, and separation of concerns. So, step two builds the candidate pool with provenance, step three decides what’s forbidden, and step four decides accept or reject.
176 00:26:13.810 ⇒ 00:26:16.429 Mouhamad: And… in here.
177 00:26:16.580 ⇒ 00:26:25.440 Mouhamad: how trust is built, basically. So, domestic parsing, as we said, so there are 6 candidates were found, with confidence high, etc.
178 00:26:25.530 ⇒ 00:26:39.780 Mouhamad: AI extractor, there is confirm all 6, find one extra, so it found this one as extra, but this one over verified with snippets, then merge result, all of these, and this one is AI only verified with a snippet, so it does verify.
179 00:26:39.930 ⇒ 00:26:40.850 Mouhamad: So…
180 00:26:41.020 ⇒ 00:26:47.990 Mouhamad: Other than if both means both extractors were found, AI snippet verified, true means the AI was grounded in the source text.
181 00:26:48.930 ⇒ 00:26:55.760 Mouhamad: Now, in all of these, it was just basically trying to find the ingredients inside a…
182 00:26:56.040 ⇒ 00:27:02.750 Mouhamad: a properly, like, shaped document, or, like, a noisy, or anything. Now.
183 00:27:02.940 ⇒ 00:27:12.450 Mouhamad: The next step is matching. So, we have a list of, like, forbidding items, and we have the ones that we found. So…
184 00:27:12.630 ⇒ 00:27:28.480 Mouhamad: First match point. What… what does it mean? So, I am… I am, as I said, I didn’t… I didn’t want to just give the LLM both of these and say, like, okay, match it together, no, because it also can make hallucination or anything. But the LLM is used in here at the semantic fallback at the end.
185 00:27:28.600 ⇒ 00:27:32.649 Mouhamad: So, how it works is, first is the exact match.
186 00:27:32.980 ⇒ 00:27:49.650 Mouhamad: So, I have a file in the code, which is the aliases. It’s called aliasesv. So, what it does is, let’s say there is benzene, which has, like, an alias name of, like, C6H6, for example, or anything.
187 00:27:49.700 ⇒ 00:27:55.449 Mouhamad: So, it builds a map of the actual name with the alias name and the canonical name.
188 00:27:55.790 ⇒ 00:28:00.620 Mouhamad: this… this is used in all of this. How? So the exact match, if…
189 00:28:00.970 ⇒ 00:28:05.240 Mouhamad: If there is, like, a propyl disk, this one.
190 00:28:05.460 ⇒ 00:28:13.250 Mouhamad: If it was found just, like, exactly how it is, there’s a match. There’s an exact match. Okay, so these ones do not run.
191 00:28:13.550 ⇒ 00:28:17.619 Mouhamad: I just found it, it said, amazing, let’s continue.
192 00:28:17.970 ⇒ 00:28:19.850 Mouhamad: If this didn’t font.
193 00:28:20.000 ⇒ 00:28:29.220 Mouhamad: isn’t found, then it goes to the second one, which is the alias and canonical. So, both sides resolve via ingredientalias.csv to a shared canonical name.
194 00:28:29.370 ⇒ 00:28:42.569 Mouhamad: So, if we have, for example, C6H6 is canonical, so benzene is forbidden much. So, even if we have the name of the actual ingredient, but we have, like, the canonical inside.
195 00:28:42.660 ⇒ 00:28:49.819 Mouhamad: the… the paragraph or anything. They can still match together, because there’s a mapping between the two, like, they need aliases.
196 00:28:50.160 ⇒ 00:28:51.400 Mouhamad: If this is…
197 00:28:51.400 ⇒ 00:28:52.389 Samuel Roberts: mapping coming from?
198 00:28:52.960 ⇒ 00:28:56.729 Mouhamad: In here. So there’s a… there’s a file called ingredientrsso.csv.
199 00:28:56.890 ⇒ 00:28:58.609 Samuel Roberts: So that’s a predetermined list, is what you’re saying?
200 00:28:58.610 ⇒ 00:28:59.720 Mouhamad: Yes, yes.
201 00:28:59.720 ⇒ 00:29:02.240 Samuel Roberts: Okay, okay, cool. I just wasn’t sure if I missed that. Okay, thank you.
202 00:29:02.470 ⇒ 00:29:03.430 Samuel Roberts: Yeah, cool.
203 00:29:03.680 ⇒ 00:29:23.649 Mouhamad: So, as I said, if this is found, amazing, don’t run this. If this is not found, if this is found, don’t run this. Now, if these two are not found, this is not correct, now it goes to the third one. This third one is just for the formula. So, earlier on, I said that OCR is noisy, especially on numbers or etc.
204 00:29:24.090 ⇒ 00:29:24.920 Mouhamad: So…
205 00:29:25.140 ⇒ 00:29:35.069 Mouhamad: so formulas like these, we can, for example, OCR can make a mistake and doesn’t include, for example, the O in here, so you can see that it’s just 2.
206 00:29:35.390 ⇒ 00:29:43.899 Mouhamad: If you use exact match, or at least canonical, it will not find it, because exact match is exact match. In a canonical, it doesn’t work that way.
207 00:29:44.220 ⇒ 00:29:58.529 Mouhamad: So now we need something for the formulas where OCR can make errors or something. So, I’m using rapid fuzz at 90 threshold. Again, 90 can be different based on the data, based on everything, and this is only on formulas.
208 00:29:58.720 ⇒ 00:30:08.169 Mouhamad: format-shaped tokens, so letters and digits, no spaces, etc. It does fuzzy matching, and if the threshold is above 90, that means it’s a match or anything.
209 00:30:08.380 ⇒ 00:30:17.439 Mouhamad: can expand it even more and more. This is where I went on to do, like, the last one. If these three don’t run, this is the last one that runs.
210 00:30:17.530 ⇒ 00:30:28.949 Mouhamad: This is just a semantic fallback, so embedding, top K cosine, LLM pairwise, verify at more than 85%, 0.85 confidence, so…
211 00:30:29.090 ⇒ 00:30:43.010 Mouhamad: if all of these is not run, I’m embedding, like, the forbidden ingredient with the actual ingredient, and trying to see if we can retrieve, like, the top 3, and the LLM will do it.
212 00:30:44.360 ⇒ 00:30:58.290 Mouhamad: these, of course, for example, in here, I am defining the aliases yes, so this also depends heavily on the data. Like, if the data is known, yes, we can do this. If the data is even a little bit big, we can also, like.
213 00:30:58.390 ⇒ 00:31:11.099 Mouhamad: match the aliases with the canonical name, etc. If it’s, like, open-ended and, like, extremely large, then yes, we cannot do that level, but it depends on the data. It depends on the client, it depends on everything.
214 00:31:11.610 ⇒ 00:31:18.480 Mouhamad: This is the forbidding matching, which is the 4-4 layer. Then, at the end, so…
215 00:31:18.600 ⇒ 00:31:22.279 Mouhamad: In here, this is just explaining…
216 00:31:22.280 ⇒ 00:31:40.529 Mouhamad: the embedding trick. So, embedding this one alone gives a meaningful vector, so it won’t retrieve against caffeine, because this is caffeine, it will not retrieve against caffeine. So, better embedding text by combining all known forms of caffeine, all this, now the formula embeds its readable name, the vector has.
217 00:31:40.680 ⇒ 00:31:42.879 Mouhamad: Real semantic meaning.
218 00:31:42.990 ⇒ 00:31:57.889 Mouhamad: So, in here, the two-state verification is just stage one, embedding retrieval, so embed candidate, cosine similarity, etc, top three above. I didn’t want to go into, like, much depth into this, because this is basically, like, the fallback.
219 00:31:58.030 ⇒ 00:32:15.599 Mouhamad: Like, most of the cases should hit in D3, and the last one is just, like, just to fall back, if actually needed. Then LLM pairwise, so do these two names refer to the ingredient, or prohibited, etc, only accept if equivalent equals true and confidence is above 85.
220 00:32:15.760 ⇒ 00:32:16.580 Mouhamad: And…
221 00:32:16.790 ⇒ 00:32:27.159 Mouhamad: Again, why not just use AI for this? So, Layer D only runs after A, B, C, and fail. Most ingredients match at layer 1, the exact same name.
222 00:32:27.160 ⇒ 00:32:39.860 Mouhamad: Semantic matching is expensive. It is expensive, it is also, prone to error, because it would require, like, embedding LLM call per candidate, which you might have, like, a lot of candidates that will…
223 00:32:39.910 ⇒ 00:32:46.119 Mouhamad: run a lot on it. And by making it the fallback, we minimize, like, the cost and maximize precision.
224 00:32:46.390 ⇒ 00:33:02.249 Mouhamad: Of course, if the client has all the money in the world, sure, run an LLM. But also, we need to optimize, it’s not just, like, a blindly just put LLM everywhere. So the forbidden list is embedded once at startup and cached, all in over Canada triggers API calls.
225 00:33:03.440 ⇒ 00:33:10.049 Mouhamad: At this point, we have a pretty good at matching what we have, but there’s one more thing.
226 00:33:10.750 ⇒ 00:33:16.579 Mouhamad: What if there was a phrase called free from… This for the gun.
227 00:33:16.850 ⇒ 00:33:18.469 Mouhamad: So…
228 00:33:18.860 ⇒ 00:33:26.610 Mouhamad: This should not trigger the rejection, even though it did say this as an ingredient, but it says free from.
229 00:33:27.060 ⇒ 00:33:29.530 Mouhamad: So… How it works?
230 00:33:29.810 ⇒ 00:33:45.980 Mouhamad: So, in here there was two directions. Either also the LLM would handle it, or I can just… what I did in here. So, I created a negation pattern.json. The reason also for this is that throughout my, like, work in LLMs and everything.
231 00:33:46.070 ⇒ 00:33:55.230 Mouhamad: I did find that AI can sometimes miss these very small things, so free from can easily be missed by the LLM.
232 00:33:55.310 ⇒ 00:34:07.709 Mouhamad: sometimes by accident, even though, because, like, for example, if the prompt is too big or anything, the one can hallucinate its temperature. There’s a lot of factors that can go. So, what I did is that
233 00:34:07.880 ⇒ 00:34:20.880 Mouhamad: I just created, like, a negation pattern, which is, like, from negation config JSON. So, free from, free from does not contain, without, free of, no added, not present, etc.
234 00:34:21.050 ⇒ 00:34:25.219 Mouhamad: How it works, a forbidden match is found, let’s say a forbidden match is found.
235 00:34:25.449 ⇒ 00:34:31.620 Mouhamad: This is the layer that runs after the forbidden message, so after an ingredient is found, the forbidden something is found.
236 00:34:31.790 ⇒ 00:34:36.679 Mouhamad: I’m checking the 80-character window before the ingredient in the evidence snippet.
237 00:34:36.780 ⇒ 00:34:46.289 Mouhamad: So, this is the ingredient, I’m picking everything that is behind. If any negation phrase is found in that window, flag as a negation context. So, if anything was found from these.
238 00:34:46.460 ⇒ 00:34:53.470 Mouhamad: That means it’s in negation. Negated matches are kept for audits, but do not drive rejection. So, if…
239 00:34:53.770 ⇒ 00:35:03.670 Mouhamad: if I found, like, a free from this, this is not… this is not gonna trigger, like, the rejection or anything. So, as an example, so this product is free from this, and…
240 00:35:03.670 ⇒ 00:35:13.809 Mouhamad: So, result match found, it did match, like, there is, like, something prohibited in here, but negation context is also true. The negation phrase is free from, so it does not reject.
241 00:35:14.150 ⇒ 00:35:19.439 Mouhamad: This is added as later on as a, like, a safety layer for the negation and stuff.
242 00:35:20.650 ⇒ 00:35:24.840 Mouhamad: Finally, after we have all of these, this is the decision layer.
243 00:35:25.110 ⇒ 00:35:33.409 Mouhamad: In here, this is an open-ended. It can go into, like, some sort of action. For example, it can send an email.
244 00:35:33.640 ⇒ 00:35:48.360 Mouhamad: the agent can do things in here. There’s a lot of possibilities. What I just wanted to do in here is just simply accept it, rejected. Intentionally simple, intentionally deterministic. If it’s rejected.
245 00:35:48.950 ⇒ 00:35:58.510 Mouhamad: No exception, no LLM involved in decision, anything is accepted, zero real, forbidden, matches, negated, warning, etc. And there’s, like, a human-readable reason.
246 00:35:58.570 ⇒ 00:36:17.660 Mouhamad: You’ll see it now in the demo when I spin it up. And then, like, contains, for example, contains this, contains that. Warning aggregation, all warnings from the extraction, pipeline, and macula are all flattened, deduplicated, and included in the response. So, anything that is, like, flagged or warning also is included.
247 00:36:18.620 ⇒ 00:36:30.749 Mouhamad: This is just in here to talk about, like, the API, so there’s an evaluate product and evaluate product. This is for the single evaluation, this is for the badge. Also, like, the forbidden list and forbidden list.
248 00:36:30.840 ⇒ 00:36:46.200 Mouhamad: This basically returns, like, the current, like, the current forbidden list loaded in the memory. Of course, in a production system, there will be, like, a database, Postgres, Mongo, etc. And here, it’s just for the demo, it’s gonna be, like, on memory.
249 00:36:48.110 ⇒ 00:37:04.919 Mouhamad: Everything is tunable, as I said, there’s nothing hard-coded. The way I like to do things is always LEGO-like, so everything can be changed from files, even a person, like, with zero knowledge of, like, coding or anything can just change things, and the code will still work.
250 00:37:05.330 ⇒ 00:37:22.699 Mouhamad: These are the tests, so I’ve done 7 files tests across everything, so testing the extraction, the pipeline, the matcher. These are all in the code that I sent. The semantic matcher, the decision, the API, etc. Design principle, why the strong engineering system, not just LLM app?
251 00:37:22.860 ⇒ 00:37:38.239 Mouhamad: deterministic where it counts, so AI improves just the extraction and the bits that the code cannot do properly. Graceful degration, so if AI fails, deterministic will still work, so the code will still work, will still produce something. Configure over code.
252 00:37:38.460 ⇒ 00:37:46.139 Mouhamad: Core parsing matching, mostly driven by JSON and prompt files, which keeps, like, interaction files and everything. Full audit trades, so…
253 00:37:46.310 ⇒ 00:38:01.839 Mouhamad: Negated matches, unverified candidates, matches, audit, all preserved, everything is being audited. Evidence ground in AI, so AI must provide, like, source snippet. It doesn’t just blindly go anything. Full provenance tracking, every candidate, like, has origin method.
254 00:38:02.070 ⇒ 00:38:13.449 Mouhamad: etc. Changing four layers mostly goes to the exact one, but it also… there’s, like, fallbacks and fallbacks and fallbacks. Independent test layer, also, I’ve tested this.
255 00:38:14.290 ⇒ 00:38:33.970 Mouhamad: This is one file, end-to-end. This is how it works. It goes from this, goes to the file extraction, direct read, because it’s text, raw data, deterministic process, section, header, etc, 11 candidates. AI mode, confirm and fill, because the confidence is higher, and we found a couple of matches. Forbidden matching, then decision rejected.
256 00:38:34.150 ⇒ 00:38:37.149 Mouhamad: Now… In here, this is…
257 00:38:38.080 ⇒ 00:38:46.969 Mouhamad: Of course, this also heavily depends on the project, on the data, on what’s the next ask, everything. I just wanted to put some…
258 00:38:47.100 ⇒ 00:38:50.209 Mouhamad: Next step and improvements that can be done on this system.
259 00:38:50.280 ⇒ 00:39:01.649 Mouhamad: Of course, this is debatable in every single case, depending on what the system, what the client has, what everything has, so some things in the near term can be done, like multiple fire…
260 00:39:01.650 ⇒ 00:39:12.580 Mouhamad: Product evaluation, so one product from multiple supporting documents in a single request. API hardening, so authentication for admin inputs, remove wildcard codes, stop returning raw internal errors.
261 00:39:12.700 ⇒ 00:39:17.589 Mouhamad: Operational visibility, so how is the box status for OCR, richer decision output.
262 00:39:17.800 ⇒ 00:39:28.790 Mouhamad: just because I said, like, it’s just, for now, it’s just, like, yes or no. Better throughput, more blocking, OCR and protocols, or some long-term things where it needs more work.
263 00:39:29.050 ⇒ 00:39:48.520 Mouhamad: So, stronger semantic matching, better evaluations, a tighter threshold, more curated synonym coverage, reduced false positives, and layer the multi-document and batch workflow, so grouped product submission, custom forbidden list per request, etc. Evaluation and benchmarking also, so formal eval.
264 00:39:48.620 ⇒ 00:40:04.910 Mouhamad: set for OCR noise, depending on the data, of course, what sort of data we have. Human-in-the-loop review, so review recommendation pass for low confidence. I can also, for example, if something is low confidence, it needs, like, an urgent human-in-loop can send, like, an email or something.
265 00:40:05.130 ⇒ 00:40:17.939 Mouhamad: Deployment, hardening, so production, authentication, rate limiting. These are next steps, improvement, as I said, depending on the data, depending on everything. Yeah, thank you. Let me just quickly… One quick question.
266 00:40:17.940 ⇒ 00:40:29.039 Samuel Roberts: Before you jump off the… Yes, yes. Yeah, you mentioned a quarantine step, that certain things could be quarantined, and then I didn’t see anything follow that up. I was just… I wasn’t sure where that fit in at the end.
267 00:40:31.350 ⇒ 00:40:32.759 Mouhamad: One second in here.
268 00:40:33.650 ⇒ 00:40:41.159 Mouhamad: So, if something was found by AI, but the snippet was not verified, so, let’s say AI found something.
269 00:40:41.260 ⇒ 00:40:47.230 Mouhamad: But the snippet did not… like, there was no snippet, there was no evidence of this. This is quarantined.
270 00:40:47.610 ⇒ 00:40:49.060 Mouhamad: We don’t trust it.
271 00:40:49.490 ⇒ 00:40:53.480 Samuel Roberts: Okay, okay. I wasn’t sure if that meant something else later in the process or not, but it just…
272 00:40:53.480 ⇒ 00:40:55.469 Mouhamad: No, no, no, this is what I’m with, yeah.
273 00:40:55.670 ⇒ 00:40:56.930 Samuel Roberts: Okay, cool, thank you.
274 00:40:57.170 ⇒ 00:41:00.749 Mouhamad: Okay, let me just quickly show you…
275 00:41:01.730 ⇒ 00:41:12.259 Mouhamad: So, this is just a quick UI that I created. Very straightforward in here. So, there is a banned ingredient, so you can see the full ingredients of these banned items.
276 00:41:12.480 ⇒ 00:41:18.120 Mouhamad: And you can upload, like, a new banned list in here, and it takes it as the new banned list.
277 00:41:18.410 ⇒ 00:41:22.049 Mouhamad: You can hear, for example.
278 00:41:22.570 ⇒ 00:41:29.319 Mouhamad: You can take a couple, for example. This is just PDF, so you can include also yours, and then you can run check.
279 00:41:29.650 ⇒ 00:41:40.230 Mouhamad: First of all, just run quickly, just this reading and parsing a footer, so I just take the suffix from it, and it just detects, oh, this is PDF, so I need to route it to the appropriate function to it.
280 00:41:40.610 ⇒ 00:41:50.320 Mouhamad: Then it gets the text, and then ingredient extraction runs, identifying ingredient using pattern, and matching error assistant. Matching against banned list now is matching against the banned list.
281 00:41:50.950 ⇒ 00:41:55.010 Mouhamad: So it’s running the exact alias formula and semantic matching layer.
282 00:41:56.230 ⇒ 00:41:59.069 Mouhamad: Then, it reduces at the end, so…
283 00:41:59.350 ⇒ 00:42:11.060 Mouhamad: And here, for this file, this is the file, the header is the Pro Moisturizer. It was rejected. The reason for that, why was it rejected? Because it contains the same word again.
284 00:42:11.120 ⇒ 00:42:27.139 Mouhamad: Match ingredients, where it was found, so it also shows you, like, the evidence in here, so it was found using the exact, search, so exact, word was, like, a match. And it was found in the document, and this is, like, the snippet where it was found from it.
285 00:42:27.330 ⇒ 00:42:32.849 Mouhamad: Also, there is a ProLotion on here, where I found this formula in here this time.
286 00:42:33.020 ⇒ 00:42:36.530 Mouhamad: And also, I found it in here with the exact match, understood the confidence.
287 00:42:36.710 ⇒ 00:42:38.060 Mouhamad: diff conditioner.
288 00:42:38.360 ⇒ 00:42:51.399 Mouhamad: Also, it found, like, this one in here, and I also found this one in here, so it found two of them. So it can find multiple ingredients where it was rejected for it, and it contains creamer, and it contains sulfate.
289 00:42:51.560 ⇒ 00:42:56.050 Mouhamad: And yeah, this is the exact mass. And, of course, you can do, for example.
290 00:42:56.520 ⇒ 00:42:58.999 Mouhamad: Where you can add, like, let’s say…
291 00:42:59.300 ⇒ 00:43:02.399 Mouhamad: one PDF, then you can add…
292 00:43:04.680 ⇒ 00:43:11.580 Mouhamad: Let’s… this is what I created. You can add 2 up to 2 images, then you can add…
293 00:43:11.580 ⇒ 00:43:14.219 Samuel Roberts: Is this in the, the code you sent over, too, this UI?
294 00:43:15.300 ⇒ 00:43:18.319 Mouhamad: Do you… I… no, it’s not. I can send it now as well.
295 00:43:18.320 ⇒ 00:43:20.410 Samuel Roberts: Yeah, that’d be great, just for completeness.
296 00:43:20.410 ⇒ 00:43:21.010 Mouhamad: Yeah, yeah.
297 00:43:21.120 ⇒ 00:43:26.780 Mouhamad: And you can do this, then you can run… Fair.
298 00:43:27.300 ⇒ 00:43:39.600 Mouhamad: Now, for each one, it will see, like, the suffix of the files. You have the PDF in here, PNG, and these two, text, and these two, and it would route it to the appropriate file.
299 00:43:40.040 ⇒ 00:43:45.750 Mouhamad: Then it will identify, like, ingredient using pattern matching, matching against bundles.
300 00:43:47.470 ⇒ 00:43:49.450 Mouhamad: So I turn on this…
301 00:43:51.840 ⇒ 00:43:59.820 Mouhamad: So, each process, it will run on each file, so it’s not like a bug, so it will run… these will run on this, and then this, then this, then this.
302 00:44:03.320 ⇒ 00:44:04.130 Mouhamad: Okay.
303 00:44:04.310 ⇒ 00:44:17.990 Mouhamad: So, this one is correct, it was accepted, no bad ingredient inside of it. This one, ProLotion, as we said before, it has come in this one, it was these. Pro Conditioner, also, for the P&G, it was accepted. Ultra Condition.
304 00:44:18.090 ⇒ 00:44:28.000 Mouhamad: This one was rejected because it contains C6H6, and it shows you, like, also it contained in here. And the radiant shampoo, it was accepted. Nobody ingredients were detected in this product.
305 00:44:28.220 ⇒ 00:44:32.039 Mouhamad: And yeah, this is how… Basically, the system works.
306 00:44:33.680 ⇒ 00:44:34.220 Samuel Roberts: Great.
307 00:44:35.990 ⇒ 00:44:47.079 Samuel Roberts: Can you talk a little bit about the… the process going in? So you, you had a… you had one slide there, talked about the different stages that… that you kind of thought through. I’m curious about the actual,
308 00:44:47.310 ⇒ 00:44:58.080 Samuel Roberts: the planning, the coding process, like, obviously, you know, AI tools, things like that, like, where… how do you… how have you wrapped your mind around using, you know, coding assist tools and things like that?
309 00:44:58.080 ⇒ 00:44:59.630 Mouhamad: Yeah, yeah, sure.
310 00:44:59.880 ⇒ 00:45:09.840 Mouhamad: So, the way I… for example, let’s talk about this project. The way I did it is… I started by reading, like, the README file on the GitHub that you had.
311 00:45:10.250 ⇒ 00:45:27.950 Mouhamad: Then, what I was doing is that I was starting, like, to brainstorming myself before anything. I was like, okay, this one, they need also this. I would, like, read the README properly, etc, etc. So I would have… I had, like, an initial thought of, like, the design, how it would be in my mind.
312 00:45:27.990 ⇒ 00:45:31.139 Mouhamad: Then what I would do is that I went to ChatGPT,
313 00:45:31.380 ⇒ 00:45:34.930 Mouhamad: And I started, like, okay, this is the context.
314 00:45:35.440 ⇒ 00:45:42.649 Mouhamad: this is what I have in mind, to do this, to do that threshold, to do this layer, to do that layer, etc, etc, etc.
315 00:45:43.830 ⇒ 00:45:50.009 Mouhamad: correct me if I’m wrong, just don’t agree with me, just tell me if anything is missing, if anything is there.
316 00:45:50.110 ⇒ 00:46:05.950 Mouhamad: then it would give me some stuff, then I would start reading, okay, I would understand, okay, if I’m not happy with something that he said, or I was, like, okay, skeptical about it, I would go back and forth, back and forth, back and forth, so just back and forth between me and him, then I would have, like, an initial design.
317 00:46:06.140 ⇒ 00:46:09.810 Mouhamad: What it does, it is, like, my design plus some corrected stuff from it.
318 00:46:10.020 ⇒ 00:46:18.470 Mouhamad: Then I’ll start coding. So, the way I like to code is not give LLM power to just…
319 00:46:18.870 ⇒ 00:46:30.470 Mouhamad: Okay, this is the architect, just go and edit 300 files. No. I use AI to code, I code it myself, but I use it just like the old days of Stack Overflow.
320 00:46:30.590 ⇒ 00:46:31.530 Mouhamad: So…
321 00:46:31.760 ⇒ 00:46:37.389 Mouhamad: I wanna… I want a code to do this and this, it would give me a code, and I’ll start wiring it in.
322 00:46:37.870 ⇒ 00:46:55.710 Mouhamad: I would use AI for coding in specific scenarios. So, for example, if there is an error inside the logs, for example, because it’s very good at reading, like, a lot of logs faster than me. Second is for test files, because it can create, like, a lot of test files also quickly.
323 00:46:55.930 ⇒ 00:47:04.510 Mouhamad: Also for anything that it needs, like, for example, like, add… if I’m… if I forgot, like, try accept, for example, of…
324 00:47:04.580 ⇒ 00:47:16.549 Mouhamad: or some comments or anything, like, also it can add it. But mostly the design is, like, designed like I told how it is, and the coding is myself.
325 00:47:16.610 ⇒ 00:47:35.359 Mouhamad: Then along the way, with the testing, with everything, like, I would find out, like, some stuff are not working, I would just go back, like, to the vision board, if I want to call it, and I would start brainstorming more ideas about it, why it’s not working, how can we improve it, etc. And it’s just back and forth, and yeah, till the system is end.
326 00:47:36.870 ⇒ 00:47:38.060 Samuel Roberts: Okay, great, thank you.
327 00:47:39.290 ⇒ 00:47:44.070 Uttam Kumaran: I have to, Sam, I actually have to jump to another call, but I guess, I wanted to ask, like,
328 00:47:44.430 ⇒ 00:47:55.880 Uttam Kumaran: Like, how much… how much of the process was, like, planning up front versus, like, executing? Was there, like, phases in where you, like, developed certain pieces, or what was the whole development process?
329 00:47:56.530 ⇒ 00:47:59.780 Mouhamad: Yeah, so I would say it…
330 00:48:00.170 ⇒ 00:48:19.539 Mouhamad: I mean, designing, I cannot give, like, a percentage, because designing, it wasn’t just, like, I designed at the beginning, and then I just stopped, and I started coding, and that’s it. No, designing because, like, when I was coding, and I was, like, figuring out things, and I, like, would, like, run some tests and etc, I would go back to the design.
331 00:48:19.540 ⇒ 00:48:23.609 Mouhamad: And I’ll just, like, alter some stuff, and I’ll tweak some stuff in the design.
332 00:48:23.680 ⇒ 00:48:24.670 Mouhamad: So…
333 00:48:24.850 ⇒ 00:48:31.110 Mouhamad: if you want to call it, like, 50-50, sure, but it’s, like, it’s an iterative process, so I would design.
334 00:48:31.280 ⇒ 00:48:42.559 Mouhamad: I would start code, I would find bugs or errors or some stuff that are not working as I intended, then I would go back to the design, and I would just tweak something, then go, then go, then go. But this is how the process goes.
335 00:48:44.110 ⇒ 00:48:44.770 Samuel Roberts: Okay.
336 00:48:46.240 ⇒ 00:48:47.180 Uttam Kumaran: Cool.
337 00:48:47.180 ⇒ 00:48:48.220 Samuel Roberts: I mean, they haven’t…
338 00:48:48.380 ⇒ 00:48:53.590 Uttam Kumaran: Yeah, go ahead. I might make you a host, but Mom, it was really, really nice. Thank you, it was really in-depth, I think.
339 00:48:53.590 ⇒ 00:49:09.500 Uttam Kumaran: You’re one of the few candidates that put together an actual presentation, so I was really, really impressed. I think, Sam, yeah, I would… I also wanted to make sure that I get a chance to look at the code, so I don’t know, so that’s something that I want to take some time on that I’ll do, but Sam, maybe I can make you host and have you, you know, close out as you… as you want.
340 00:49:10.270 ⇒ 00:49:11.020 Samuel Roberts: Totally.
341 00:49:11.020 ⇒ 00:49:11.740 Uttam Kumaran: Okay, okay.
342 00:49:12.480 ⇒ 00:49:14.150 Uttam Kumaran: Thank you, Mohamed, it’s really nice to meet you.
343 00:49:14.150 ⇒ 00:49:16.339 Mouhamad: Thank you, Autumn. Thank you. Pleasure meeting you.
344 00:49:16.690 ⇒ 00:49:17.380 Mouhamad: Bye.
345 00:49:17.690 ⇒ 00:49:25.620 Samuel Roberts: All right. Yeah, I mean, so I, like I said, I haven’t dug through the code very much yet, I didn’t have too much time, so, but,
346 00:49:26.030 ⇒ 00:49:31.269 Samuel Roberts: I will do that. If you could definitely send across that, .
347 00:49:31.270 ⇒ 00:49:35.519 Mouhamad: I send the backend code, I can send the UI one next, yeah.
348 00:49:36.090 ⇒ 00:49:45.020 Samuel Roberts: Perfect. And then, yeah, if there’s anything else to know about… I haven’t… I assume there’s, like, a README and how to run and get everything set up.
349 00:49:46.860 ⇒ 00:50:00.709 Mouhamad: There is a README, that explains how the code works and everything, and the way to set it up is just you need the… like, I use OpenAI, so I just use the key of OpenAI, that’s it. And the .en. Perfect.
350 00:50:00.710 ⇒ 00:50:05.680 Samuel Roberts: Okay, great. Yeah, do you have any other questions that, I can answer?
351 00:50:05.910 ⇒ 00:50:06.480 Samuel Roberts: No.
352 00:50:06.480 ⇒ 00:50:11.079 Mouhamad: I think all is clear. It was a pleasure working on this project. It was a fun project.
353 00:50:11.080 ⇒ 00:50:18.909 Samuel Roberts: Okay, cool. Good. Yeah, we tried to put something together that was interesting to work on, and not just a check the boxes kind of thing, but,
354 00:50:18.910 ⇒ 00:50:19.510 Mouhamad: Yeah, yeah, yeah.
355 00:50:19.510 ⇒ 00:50:23.539 Samuel Roberts: together so far, so hopefully I can dig in and be, be happy with it.
356 00:50:23.540 ⇒ 00:50:25.080 Mouhamad: Sure, sure, sure.
357 00:50:25.080 ⇒ 00:50:37.570 Samuel Roberts: Yeah, I know we have, some other candidates kind of in the pipeline, so I don’t know if I can commit time-wise to, like, when you’ll get something back, but, like I think I said before, like, we’re trying to move relatively quickly, we…
358 00:50:37.630 ⇒ 00:50:50.270 Samuel Roberts: I don’t want to drag things out, and what are we at today? Tuesday? So, yeah, maybe… maybe end of the week, but don’t quote me on that, just in case it’s, I’m not sure what other things are in the pipeline, 100%, but.
359 00:50:50.270 ⇒ 00:50:54.319 Mouhamad: No, no, I completely get it. I know how schedule can get hectic, no, no, I get it.
360 00:50:54.320 ⇒ 00:51:06.849 Samuel Roberts: Totally, totally. I really appreciate that. But, yeah, if we have any other questions, we’ll reach out, once I get through the code, hopefully things make sense, and if not, I’ll ping you or something, but, thank you so much for the time.
361 00:51:06.850 ⇒ 00:51:09.100 Mouhamad: Will do. Thank you so much, Sam, thank you for your time.
362 00:51:09.100 ⇒ 00:51:10.410 Samuel Roberts: Yeah, have a good one.
363 00:51:10.640 ⇒ 00:51:12.100 Samuel Roberts: Bye. Yeah. Bye-bye.