2024-03-13_activity-schema-overview

Meeting Title: Activity-Schema-Overview Date: 2024-03-13 Meeting participants: Agustin Bergoglio, Uttam Kumaran, Bryce, Ryan Luke Daque, Patrick Trainer

WEBVTT

1 00:00:04.711 ⇒ 00:00:05.551 bryce: and

2 00:00:08.631 ⇒ 00:00:14.191 bryce: it seems more promising. Alright! Can you all see my screen.

3 00:00:14.331 ⇒ 00:00:15.531 Patrick Trainer: Yep, yep.

4 00:00:15.761 ⇒ 00:00:30.920 bryce: alright slick, so I have not rehearsed this at all. I threw this together very haphazardly, so we’ll see how this goes. But yeah, so this is a talk about a concept that I’m calling event-centric data modeling.

5 00:00:31.091 ⇒ 00:00:34.261 bryce: So at a high level, what we’ll talk about today is

6 00:00:34.431 ⇒ 00:00:38.411 bryce: what is this weird, nebulous term

7 00:00:38.461 ⇒ 00:01:02.210 bryce: that. Maybe other people have used and maybe not. It’s not actually that original under the hood. Where does this modeling methodology fit into the general data transformation workflow like, where does the concept of activity schema fit into all of this like blabbering? I’m about to do? And then how does all this, then tie back into what we know today, which is like

8 00:01:02.401 ⇒ 00:01:09.300 bryce: Dbt for data, transformation, bi tools for like semantics, visualization, aggregation, etc.

9 00:01:09.421 ⇒ 00:01:14.871 bryce: at first. Humor me. You know, make an observation based on my

10 00:01:14.941 ⇒ 00:01:17.931 bryce: years of experience working as a

11 00:01:18.181 ⇒ 00:01:34.960 bryce: zoom like a data general list so like at our core. If we like, really look at our workflows like the most, the single, most common thing that we like analytics, engineers, data engineers, data analysts do. We’re building denormalize data sets where like data sets are

12 00:01:34.961 ⇒ 00:01:49.310 bryce: persistent tables that serve as like the basic building blocks for defining aggregated metrics and charts that we then use for analysis and reporting and Vi tools and notebooks, etc.

13 00:01:49.311 ⇒ 00:01:57.480 bryce: And so like, how do we get to the point where we build data sets where we usually build data pipelines right where the pipelines

14 00:01:57.511 ⇒ 00:01:59.940 bryce: usually looks something like this, where we have

15 00:02:00.031 ⇒ 00:02:25.071 bryce: data sources that come from Sas tools versus like transactional databases, etc. We replicate them over to our data warehouse. We have this transformation layer that’s usually defined in Dbt, and then it spits out these artifacts, these data sets that we categorize as marks or so like subsets of data sets that are related to each other, that serve

16 00:02:25.380 ⇒ 00:02:37.040 bryce: common use case, like marketing, reporting, or like sales reporting or product analytics, reporting so on and so forth. So when it comes to

17 00:02:37.381 ⇒ 00:02:39.190 bryce: answering a new question.

18 00:02:39.571 ⇒ 00:03:02.831 bryce: the intent of this like of this data, pipeline design is to start with the marks. But usually that does not happen right real like in my at least in my experience. Maybe this is just because I worked with people who are bad at creating a data marks and the normalized data sets that are used available to be used for broad use cases looking at you, Tom.

19 00:03:02.831 ⇒ 00:03:13.440 bryce: like people when they want to answer new. That’s a joke that is very good at this shot, as we all know. Like, when we wanna answer a new question that myself included. The most common

20 00:03:13.441 ⇒ 00:03:22.440 bryce: pattern that I see is like everybody, goes back to the source data and then builds their remarks like a brand new mart to answer their new question from there.

21 00:03:22.781 ⇒ 00:03:23.941 bryce: So

22 00:03:23.951 ⇒ 00:03:38.140 bryce: what? Why is that the case? Right it? And in my opinion, it’s because oftentimes data marks are designed for very specific use cases. So there’s predefined granularity and dimensions baked into the data marks that are hard to like

23 00:03:38.401 ⇒ 00:03:58.331 bryce: work around without affecting somebody’s existing reporting modules or analysis modules. So you don’t want to break what already we don’t want to break what already exists. So we have 2 options. We can either like go up a layer into the transformation layer and try to extend the existing data mark. Or we can just build a new data mark from scratch. But

24 00:03:58.431 ⇒ 00:04:09.550 bryce: when we go and look at. So let’s say we want to go back into the transformation layer to extend the existing data mark. When we actually look under the hood, it looks like

25 00:04:09.761 ⇒ 00:04:36.221 bryce: fuck, all, like patchwork code, defined it like arbitrary levels of granularity with like business logic based, like See, like sequel filters baked in, they can only be used for like one specific use case. And it’s like, I don’t know how like like as in as somebody who has tried to do this before. I look at this code. Now, like I as not, the code writer don’t know how to be able to leverage these like intermediate transformation steps and

26 00:04:36.421 ⇒ 00:04:52.201 bryce: and use them to extend an existing data. Mark quickly. So what am I going to do? I’m just gonna go back to square one start with the source data, because I know roughly how that structured, and then, just like build everything from scratch and have more control over the 6, the like, the outcome.

27 00:04:52.481 ⇒ 00:04:54.051 bryce: So

28 00:04:54.141 ⇒ 00:05:09.141 bryce: all that said my proposal is, we swap out like Hodge Podge, into like these, like hodgepodge approaches to intermediate transfer transformations with a standard that I’m calling event-centric data modeling.

29 00:05:09.251 ⇒ 00:05:10.931 bryce: So what is that?

30 00:05:11.011 ⇒ 00:05:37.870 bryce: It’s like a super lame definition? But it is the practice of building event models. What I’ll describe in a second as an intermediate transformation standard in the data warehouse which then serves as the common entry point for all downstream reporting and analysis a data set derivation work. So for anyone who’s worked like on the like software application side of the tech stack.

31 00:05:38.191 ⇒ 00:05:40.361 bryce: think of this

32 00:05:40.411 ⇒ 00:05:47.221 bryce: paradigm as event sourcing after the data gets to the data warehouse.

33 00:05:47.691 ⇒ 00:05:49.371 bryce: so like.

34 00:05:49.551 ⇒ 00:05:58.751 bryce: I basically defined event-centric data modeling with itself and with the reference to itself in the definition. So we should talk about what is an event model?

35 00:05:58.781 ⇒ 00:06:15.000 bryce: So it is an immutable data set or an immune like like the table that has a mutable business logic. So the logic does not change like if we wanna like drop and recreate the table, like all the rows and values, should be exactly the same as the point to time at which we ran it.

36 00:06:15.491 ⇒ 00:06:33.461 bryce: Each model is like a single table that represents a key step in a key business process that the company collectively understands and is interested in observing and monitoring. So where each row represents an instance like an occurrence of that step

37 00:06:33.561 ⇒ 00:06:43.801 bryce: in practice like this really. Isn’t that different from like fact tables, or like Scd. 2 tables in Star schema? But that’s like a little bit of a tangent.

38 00:06:43.921 ⇒ 00:06:47.161 But at its core, like this model is

39 00:06:47.281 ⇒ 00:06:52.881 bryce: super super simple, because it really only requires 3 columns, which is

40 00:06:53.111 ⇒ 00:07:21.870 bryce: a an event name which is like. So do you give some high level context of like what the event pre like, what the table represents, so it could be like prospect entered sales funnel, or like sales, rep email, prospect or customer place to order. So that would be the event. Name the entity id. So this is like the core business unit, or like actor that we are interested in.

41 00:07:21.921 ⇒ 00:07:43.920 bryce: That we wanna that we wanna analyze and that is engaging or affected by the event in question, so in theory like this, could apply to more than one like more than one entity could be associated with a single event. Right? So I gave the example of like sales rep emailed prospect so we could have 2 different. We could have at least 2 different entities on that

42 00:07:44.291 ⇒ 00:07:57.251 bryce: on that event. So we could have, we would have the prospect. Is one entity. And then if we wanna do like oper like sales, operations, analytics, then the sales rep could also be another entity.

43 00:07:57.331 ⇒ 00:08:15.531 bryce: And then the last. But hands down. Most important to attribute on this, on these event. Around these event models is the timestamp. So the timestamp at which each event instance occurs. So think of this is like an intermediate transformation standard, where you just put Timestamps on fricking everything.

44 00:08:15.631 ⇒ 00:08:38.391 bryce: Then if you wanna get fancy and we will get fancy in a moment, there are additional optional columns. We can define a unique Id which could be like a surrogate key or a natural key that comes from like the system that generates the event in question. And then we can have event, specific attributes. But remember these, this is immutable. So the event, the attributes that are specific to the event

45 00:08:38.721 ⇒ 00:08:43.030 bryce: have to be known at the time that the event occurs. So think about like

46 00:08:43.101 ⇒ 00:08:46.561 A customer plan. An event called customer place order.

47 00:08:46.811 ⇒ 00:09:04.450 bryce: We’ll know the total cost, which is a sum of all the line items, the taxes and etc. So we can include that as an attribute on that event. But total cost wouldn’t apply to an attribute like prospect entered funnel. That’s like that. This attribute is specific to the customer place to order event.

48 00:09:04.481 ⇒ 00:09:08.901 bryce: however, we couldn’t have an attribute like

49 00:09:08.941 ⇒ 00:09:23.610 bryce: total return value on the customer placed order event. Because if we think about the logical sequence of events returns on an order happen after the order is placed, so we don’t know any information about the returns at the point to which the order is placed.

50 00:09:23.921 ⇒ 00:09:26.701 bryce: cool. So

51 00:09:26.751 ⇒ 00:09:46.211 bryce: couple of other clarifying questions, why would we? Why would anybody want to use an intermediate transformation standard? And like in short, it like makes teams work faster together. When there’s a standard in place. There’s consistent contribution styles across team members. So like certain, like

52 00:09:46.211 ⇒ 00:09:59.901 bryce: seemingly trivial design decisions are more guaranteed to happen in a standard way, and it makes it easier for team members to collaborate and contribute to code that they maybe didn’t originally write. So now, like

53 00:10:00.501 ⇒ 00:10:09.371 bryce: continuing to use Zootam as the example if I want to extend the data, mark that Utam built and Utam built a

54 00:10:09.431 ⇒ 00:10:18.771 bryce: an event model called Customer Place to order, it’s very easy for me to go and grab that information, because I know exactly how it should be structured and what information should be contained in it.

55 00:10:18.861 ⇒ 00:10:21.581 bryce: And so on and so forth.

56 00:10:21.791 ⇒ 00:10:48.440 bryce: So yeah, going back to like beyond, just like the more philosophical question of why having, why have an intermediate standard? In the first place, why is event-centric modeling that stand that standard so like the standard schema, makes for really consistent and predictable transformation patterns to go from like individual event models into a downstream data set that we use for

57 00:10:48.921 ⇒ 00:11:14.330 bryce: like for an out like for analysis and reporting purposes. I like it from a like a business logic standpoint, because it like prompts Pete, us as developers to think about breaking our code up into like small and logical chunks, turns that, like it turns sequel, queries into like much more digestible, like 30 to 50 line snippets, as opposed to like 250 line behemoths that like nobody can grok inside of like

58 00:11:14.430 ⇒ 00:11:25.200 bryce: the the pain of like a single obsc code, view or file, and then from like a requirement, scattering standpoint. It just it makes things dead simple, like

59 00:11:25.271 ⇒ 00:11:39.260 bryce: we go to our business stakeholders or functional team members. And we say, What is your business process. They list a set they list a set of steps, and we take those steps in the process, and they map one to one to new event models that we need to create.

60 00:11:39.321 ⇒ 00:11:43.581 bryce: that is, and we know exactly what what columns we need to add

61 00:11:43.921 ⇒ 00:12:02.931 bryce: and then on top of it, like, because of all these ups through these benefits that I described, it makes it much easier to extend existing data marks. Because now we can just add a new data set from our own set of from our preferred set of event models. And then with the data set that we create.

62 00:12:03.031 ⇒ 00:12:09.661 bryce: we can define joins to the existing models in them. In the data mart in our Bi tool, or in

63 00:12:09.871 ⇒ 00:12:12.821 bryce: Dvt like Dashi M. Or whatever you want.

64 00:12:12.841 ⇒ 00:12:22.201 bryce: So next piece is the puzzle is like, how does the how does activity schema fit into all of this?

65 00:12:22.311 ⇒ 00:12:47.350 bryce: And the activity schema is just like a like an opinionated implementation of event-centric modeling. It just has some extra column enrichment columns applied, and it defines, and it asserts, like some table clustering or partitioning to in ensure like efficient querying of denormalize data sets, they also use the like that this paradigm also uses the term

66 00:12:47.351 ⇒ 00:12:58.421 bryce: activity instead of event. And while some people who are like hardcore peers would say that events and activities are strikingly different, like for sake of like

67 00:12:58.641 ⇒ 00:13:03.631 bryce: simplicity, like, they are more or less the same thing.

68 00:13:03.661 ⇒ 00:13:15.140 bryce: so all that being said. like activity, schema is just like a like a an opinionated approach to event-centric modeling. We’re event centric modeling is kind of like the umbrella concept.

69 00:13:15.351 ⇒ 00:13:28.231 bryce: So if we’re gonna use activity schema as our preferred mode of event-centric modeling to do the core thing that we want to do over and over again, which is build new data sets to answer new questions

70 00:13:28.361 ⇒ 00:13:55.051 bryce: there. First, I should talk about a few key concepts related to data set building. And then I’ll talk through like the the core workflows. So the first concept is the entity we’ve talked about that before that could be a customer or a user, the next concept is the primary activity. So this represents the event that will define the granularity of the data set. So if we want to do an analysis around

71 00:13:55.051 ⇒ 00:14:06.581 bryce: orders, then we would probably want to select E, the customer placed order activity as our primary activity, and then build around that

72 00:14:06.751 ⇒ 00:14:26.130 bryce: A joined activity is an event to link back to the primary activity, using a combination of the entity, Id and some and Timestamps on each of the 2, on the joined activity and the primary activity. So on the concept of Timestamps and joins. There is this

73 00:14:26.341 ⇒ 00:14:28.321 bryce: like? Very

74 00:14:28.561 ⇒ 00:14:37.260 bryce: and like, I guess, unsophisticatedly named concept of a temporal join, which is basically like a set of

75 00:14:37.471 ⇒ 00:14:52.750 bryce: timestamp based join rules or requirements that are used for linking a joint activity back to the primary activity. So we could say, like, we want to append the first ever, like

76 00:14:52.801 ⇒ 00:14:55.161 bryce: customer, created account.

77 00:14:55.331 ⇒ 00:15:07.151 bryce: at events, to each of the customer place to order events, and that’s going to get like, have a certain set of like timestamp criteria in like predefined in the join. And then last, but

78 00:15:07.321 ⇒ 00:15:28.130 bryce: certainly not least, there is a there’s the concept of the dataset column, and this is like a dimension or a measure in the data set, drive from either picking an attribute from the primary activity, or select or picking an attribute from a joined activity, and then applying the appropriate aggregation to it. If aggregation is needed.

79 00:15:28.131 ⇒ 00:15:56.801 bryce: So in practice, like, where do we? How do we actually take these key concepts and apply them into the workflow building a data set in the realm of activity schema. So the first thing we do is we pick our entity. So we pick our always a, let’s build with customers. Then we pick a primary activity. So let’s say that we wanna analyze orders. We pick customer place to order. That’s gonna be our granularity of the data set. Then we pick primary attributes to include in the in the data set

80 00:15:56.861 ⇒ 00:16:07.341 bryce: that is so that could be like the order total, for example, or we could pick the or we could pick the timestamp, and so that we know the timestamp would be to the order records.

81 00:16:07.461 ⇒ 00:16:20.261 bryce: And then, so long as you have more data, set columns that you want to add to your data, set, you pick new activities to join you, define the temporal, join requirements, and then you pick attributes from that

82 00:16:20.281 ⇒ 00:16:23.070 bryce: temporarily joined activity that you’re linking back.

83 00:16:23.121 ⇒ 00:16:33.381 bryce: Apply aggregation sums, means mediums, etc., and then they render columns in the data set. Then you have your data set.

84 00:16:33.431 ⇒ 00:16:47.290 bryce: So last, but certainly not least like, how do these kind of like high level concepts, tie into workflows with tools that we’re familiar with so specifically like Dbt, and like a the business intelligence layer like dash

85 00:16:47.511 ⇒ 00:16:51.771 bryce: evidence, real Booker, etc.

86 00:16:51.851 ⇒ 00:16:52.911 bryce: So

87 00:16:53.061 ⇒ 00:17:13.650 bryce: all like event-centric data model, in my opinion. And the way that I’ve been doing this thus far, all event centric data modeling and the most granular dataset derivation should happen in the In. Dbt, so that’s creating our event models and then creating our data sets that query directly from events.

88 00:17:13.661 ⇒ 00:17:35.990 bryce: And then when it comes to Bi, we take those granular data sets and expose them as views on a one to one basis, and then, for people who want to like build full, fully fledged marks, they can also build explorers and link, and, like link individual views, together with like known predefined join requirements on like primary keys and stuff like that.

89 00:17:36.241 ⇒ 00:17:52.230 bryce: The reason why I like this workflow is it leverages each of these tools for what they’re best at. So in like Dbt, that is like defining and running data pipeline code with known dependencies. And from the bi layer it is

90 00:17:52.231 ⇒ 00:18:12.951 bryce: aggregating and visualizing pre pre materialized data sets. I’ve been doing this for like 2 years at a company called Ashby. A full time employee there. I am a team of one. I was hired when there were 30 people, and now there are over a hundred, and we have no

91 00:18:13.001 ⇒ 00:18:18.040 bryce: see like foreseeable need to hire anybody else, because this workflow has made it

92 00:18:18.251 ⇒ 00:18:26.951 bryce: much, much faster for me as an IC to spin up data sets to answer questions on the fly real

93 00:18:27.141 ⇒ 00:18:49.120 bryce: in a rapid manner. So, anyway, that’s my quick feel. I suppose I should probably give myself a shameless plug. I have an open source Dbt package that like doesn’t opinionated, implemented, semi opinionated implementation of activity. Schema modeling and dbt, it. Ha! Like it’s it offers some nice scaffolding and guard rails when it comes to like

94 00:18:49.281 ⇒ 00:19:05.201 bryce: implementing this exact kind of workflow. I have like a lot of vision for like future things that it might do, but in the near term like this is exactly the workflow that I use my date job. And they like me a lot there. So I yeah, definitely, strongly recommend it

95 00:19:06.021 ⇒ 00:19:15.441 Uttam Kumaran: nice? No, this is great. I guess I had a question about like all the attribute data like, for when when we have an order for one of our clients. It comes with so much

96 00:19:15.901 ⇒ 00:19:19.771 Uttam Kumaran: like attribute data that is not related to

97 00:19:20.701 ⇒ 00:19:29.510 Uttam Kumaran: like just a timestamp of the order, like there’s like, if there’s a process status. There’s all sorts of different other dates.

98 00:19:29.541 ⇒ 00:19:35.341 Uttam Kumaran: There’s like line items. you know, there’s like hierarchy like, oh, how do you deal with all that

99 00:19:35.401 ⇒ 00:19:37.380 Uttam Kumaran: stuff that comes in.

100 00:19:37.531 ⇒ 00:19:54.411 bryce: So when we’re like things that we would think of is like slowly changing dimensions or activities in their own right? Right? So when you have something like an order status. I would just create an event or an activity called like order change status,

101 00:19:54.611 ⇒ 00:20:02.260 bryce: yeah, or customer change order status. You can come up with the naming like the naming standards for like as you see fit.

102 00:20:02.701 ⇒ 00:20:25.581 Uttam Kumaran: and then so we so everything comes in as dbt. Everything comes in as the event. And then let’s say we’re, for example, one of the things we do is, we calculate, a total profit which takes in like the product cost from one area, takes in how much we sold it for from shopify takes in the shipping data from ship station. So, and all those are linked

103 00:20:25.851 ⇒ 00:20:27.861 Uttam Kumaran: through some joins.

104 00:20:28.051 ⇒ 00:20:35.080 Uttam Kumaran: but like so so is that that there’s like those events that come in associate with an order. But then there’s also these like aggregated

105 00:20:35.271 ⇒ 00:20:39.421 Uttam Kumaran: thumbs that we’re doing are like for the calculations. Like, where do those live?

106 00:20:39.581 ⇒ 00:21:02.620 bryce: Yeah. So they would be their own, a events or activities as well, and then you can. One of the like like not getting too deep into the technical implementation is, you can add extra, join keys on to the like temporal, join criteria. So like, if you wanted to do like create reporting at like, if you have like a customer stream, would you wanted the like.

107 00:21:02.651 ⇒ 00:21:04.681 bryce: The column data to be like.

108 00:21:05.041 ⇒ 00:21:25.861 bryce: like rolled up to the order level. For example, like, if you wanted to like, basically create like a fact, orders table off of your customer stream. You could like, have your placed order events as your primary activity, and then link all the other order. Events that occur returns shipping status updates, etc.,

109 00:21:25.861 ⇒ 00:21:50.501 bryce: back to the primary event on both the customer Id and the order id, and then define your attributes is like the sum of like the price, or whatever to get your values. And then what’s nice? There is you get the building blocks of individual calculations that then need to be summed up to create your aggregate level metrics like

110 00:21:50.501 ⇒ 00:22:06.280 bryce: like profit and revenue and operating costs. And so like what happens under the Hood implicitly. Is you start to create this is starting to get more into the realm of like where I really see this going. It’s like you kinda get like a metrics tree for free

111 00:22:06.331 ⇒ 00:22:31.830 bryce: from this and by like adhering to this like this data modeling standard. And then, when that is available, there are a lot of really fun things that you can do in the analysis and reporting world that just get cascaded down automatically because the data is broke up into these likes like the data, transformation steps are broken up into these like, well-defined, like, small steps like this.

112 00:22:32.111 ⇒ 00:22:32.941 Uttam Kumaran: Okay.

113 00:22:33.531 ⇒ 00:22:34.351 bryce: yeah.

114 00:22:34.821 ⇒ 00:22:42.981 bryce: But yeah, I think like? These are good questions, and a clear indication to me that, like a logical follow up

115 00:22:43.041 ⇒ 00:22:46.370 bryce: or like logical, follow up here, would be like

116 00:22:47.251 ⇒ 00:22:52.610 bryce: walking through a like walking through an actual example. So like.

117 00:22:52.931 ⇒ 00:23:05.571 Uttam Kumaran: come up with a basic example. And let’s do that. Because, like, yeah, we have a ton of data, and we can easily come up with a quick example. And then I mean, I would love to to. Yeah, we even talk about how we

118 00:23:05.581 ⇒ 00:23:13.000 Uttam Kumaran: we could try to use this, for like an when, especially, this is most helpful when we bring on a client initially

119 00:23:13.161 ⇒ 00:23:27.211 Uttam Kumaran: and then again, you’re exactly right, is like our goal. Here is like we have a lot of people working on many parts of the stack for every client. and the modeling layer is one of the things that takes the most to kind of absorb and understand

120 00:23:27.411 ⇒ 00:23:34.290 Uttam Kumaran: And then you can see. Similarly, we’re we’re trying to do stuff on and play around with BIS. Code platforms

121 00:23:34.341 ⇒ 00:23:37.140 Uttam Kumaran: because it’s way faster to do the development.

122 00:23:37.171 ⇒ 00:23:42.871 Uttam Kumaran: So like constantly thinking about, like how to improve the speed of these things.

123 00:23:43.661 ⇒ 00:23:46.831 bryce: Yeah, yeah, this like this.

124 00:23:47.081 ⇒ 00:24:00.951 bryce: like modeling paradigm, I think we’ll play really, really nicely with like with bias code tooling. Cause you probably just know the metrics, and then you could just get a ton of stuff templateize like if you

125 00:24:00.951 ⇒ 00:24:21.650 bryce: template the queries with a metric name, or the activity name or whatever. Yeah. And that’s the other nice thing, too, is like. So people come from different like source data. But like also like different businesses, tend to coalesce around similar concepts of like the steps and like key business processes, and the steps in those processes. So like

126 00:24:22.381 ⇒ 00:24:38.371 bryce: like most B 2 BS companies will have, like a very similar looking set of steps or events in their sales funnel, for example, and so like, whether they’re like sales. Data originates from Hubspot or salesforce. These abstractions like

127 00:24:38.461 ⇒ 00:24:47.471 bryce: apply and like map in a pretty consistent fashion, people just need to like define their mapping from like salesforce opportunity data to this like

128 00:24:47.501 ⇒ 00:24:55.000 bryce: mild abstraction one time, or Hubspot to the subtraction one time. And then it just applies across a lot of different businesses.

129 00:24:57.731 ⇒ 00:24:58.511 bryce: Yeah.

130 00:24:59.451 ⇒ 00:25:02.060 Uttam Kumaran: okay, cool. Anyone else have questions.

131 00:25:04.071 ⇒ 00:25:12.180 Ryan Luke Daque: I guess I’m just wrapping my head around it. I can’t quite imagine it yet. But but maybe, like, if we have an example.

132 00:25:12.191 ⇒ 00:25:15.561 Ryan Luke Daque: that that would probably be easier to understand

133 00:25:15.841 ⇒ 00:25:18.990 Ryan Luke Daque: then, like, imagine, would look like, Yeah.

134 00:25:19.171 ⇒ 00:25:20.741 Patrick Trainer: is there

135 00:25:21.021 ⇒ 00:25:26.720 Patrick Trainer: like a relational table of all of the events used? Or are

136 00:25:26.771 ⇒ 00:25:28.791 Patrick Trainer: events like defined

137 00:25:30.171 ⇒ 00:25:32.401 Patrick Trainer: on like a per table basis?

138 00:25:33.671 ⇒ 00:25:53.081 bryce: So my recommendation is, yeah, there, there isn’t like a predefined like definition of tables that exist like they cause. The logic lives in code. My recommendation like and honestly, the way that this, like the way that my Dbt package works, is to define one dbt model per event. So like

139 00:25:53.121 ⇒ 00:26:10.580 bryce: in theory, like some of this, data comes in really nicely like, if, like click stream data from segment, router, stack, etc., like you could use the like the track table and get the like event, name, and like, have, like one select statement that creates dozens, if not hundreds, of different events.

140 00:26:10.711 ⇒ 00:26:15.051 bryce: and that will, like

141 00:26:15.121 ⇒ 00:26:27.321 bryce: my personal opinion, is to like, take the time to be explicit and like, specify the like, the set of events that you really want to use. And if you want, like a generic like like

142 00:26:27.471 ⇒ 00:26:37.371 bryce: track impression, event, you can like do that as well. And there are ways to trivially subclass events when you’re creating data sets. But

143 00:26:37.441 ⇒ 00:26:43.320 bryce: yeah, as a starting point, I recommend in like in Dbt land like one dbt. Model per like

144 00:26:43.341 ⇒ 00:27:02.361 bryce: uniquely namespaced event and then the nice thing about Dbt, is that, like the Dbt manifest graph has a like stateful knowledge of all of the events, in all the like, in all the streams. That are that exist in the project, and how they all relate to each other.

145 00:27:02.461 ⇒ 00:27:07.280 Patrick Trainer: Okay, so so in that case it makes it like pretty deterministic. Then

146 00:27:07.371 ⇒ 00:27:08.861 bryce: yep, exactly

147 00:27:10.291 ⇒ 00:27:11.201 Patrick Trainer: got it

148 00:27:12.661 ⇒ 00:27:13.421 bryce: cool.

149 00:27:15.041 ⇒ 00:27:20.740 Ryan Luke Daque: Is this the the Github pack. I mean the Dbt package that you’re referring to. I just sent

150 00:27:22.421 ⇒ 00:27:26.950 Ryan Luke Daque: let me. Yeah, that is it.

151 00:27:28.751 ⇒ 00:27:29.581 bryce: Yeah.

152 00:27:30.151 ⇒ 00:27:48.170 bryce: The documentation is long and preachy, and not very like tutorial, friendly, just has, like all the different concepts documented out, I at some point in my life, we’ll create like a like a work example in a separate repo that I think will be easier for people to rock. But yeah.

153 00:27:48.381 ⇒ 00:27:50.620 bryce: in the near term I’d be happy to like

154 00:27:50.831 ⇒ 00:27:59.790 bryce: link up. Maybe with you, Tom, we can do tag team our work example. And then I’d be happy to like walk through it with with the whole team.

155 00:27:59.991 ⇒ 00:28:01.230 Uttam Kumaran: Yeah, okay.

156 00:28:01.281 ⇒ 00:28:02.551 bryce: see if it’s useful.

157 00:28:03.841 ⇒ 00:28:05.171 Uttam Kumaran: Okay, perfect.

158 00:28:06.551 ⇒ 00:28:14.501 Uttam Kumaran: Okay? So maybe like me, and you can catch up sometime. And then also, I can send you this recording, too, if you need it for for you, for your presentation as well.

159 00:28:14.531 ⇒ 00:28:19.240 bryce: Yeah, that would be great. I think the most obvious piece of feedback is like.

160 00:28:19.591 ⇒ 00:28:33.220 bryce: like, I’m very much of the mind that people like need to be one over to the idea of like having us like an intermediate transformation standard. But it’s clear that like, without

161 00:28:33.351 ⇒ 00:28:44.480 bryce: a concrete example to talk through like, it’s hard like, it can be hard to. It can be

162 00:28:45.000 ⇒ 00:28:56.910 Uttam Kumaran: yeah. And so definitely, one, I think, is like, if you’re starting from scratch, this is great. But I think for the most part you’re gonna be getting people who are migrating. That’s also a piece is like, maybe a good slide or something on like.

163 00:28:57.511 ⇒ 00:29:04.811 Uttam Kumaran: hey? Well, I’m aware that like this is something that requires a bunch of changes. And here’s like how you would go from like a normal

164 00:29:04.991 ⇒ 00:29:17.800 Uttam Kumaran: kind of cluster cluttered like environment to this and like that could even be a good thing like, how do you implement? Say more for the, for the, for the talk like, if you were to do this today like, what are the key steps. And yeah.

165 00:29:18.451 ⇒ 00:29:20.070 Patrick Trainer: yeah, that. G, ptt.

166 00:29:21.021 ⇒ 00:29:21.921 Uttam Kumaran: yeah.

167 00:29:22.071 ⇒ 00:29:30.921 bryce: truly. Yeah. And but also there is, I have some. I have some thoughts there, like, it’s a pre like if a company has dbt in place already, it’s like.

168 00:29:31.841 ⇒ 00:29:50.360 bryce: I don’t think it’s too difficult to like. Skunk works their own like side project in. So like as like, it’s just like a separate sub directory and then go from there. But it’s another good piece of feedback I can. I can certainly work that into

169 00:29:50.721 ⇒ 00:29:54.861 bryce: into the content of this talk. Yeah.

170 00:29:56.261 ⇒ 00:30:01.800 Uttam Kumaran: okay, cool. Let me send that to you. And then, yeah, let’s catch up sometime next week, maybe.

171 00:30:02.171 ⇒ 00:30:12.990 bryce: Yeah, thank you all for humoring me while I talk through this. And hopefully, I will be back in the not too distant future with a more concrete example to walk through.

172 00:30:13.331 ⇒ 00:30:15.281 Agustin Bergoglio: Nice. Thanks. Rice.

173 00:30:15.421 ⇒ 00:30:19.610 Ryan Luke Daque: it’s really nice to have like a sharing sessions like this. Yeah.

174 00:30:19.811 ⇒ 00:30:21.640 bryce: yeah, for sure, I appreciate

175 00:30:21.871 ⇒ 00:30:22.920 Patrick Trainer: appreciate it.

176 00:30:23.201 ⇒ 00:30:26.141 Uttam Kumaran: Nice price.

Brainforge Knowledge

Explorer

2024-03-13_activity-schema-overview_5fac9804

Graph View