Analytical processing problem - olap
This is a discussion on Analytical processing problem - olap ; We collect TV viewership data to provide details of ‘who watched what’ to advertising agencies. A typical report would show the distribution of the number of people who watched a given advertisement. The report will be further sliced into channels, ...
![]() |
| | LinkBack | Thread Tools | Display Modes |
|
#1
| |||
| |||
| to advertising agencies. A typical report would show the distribution of the number of people who watched a given advertisement. The report will be further sliced into channels, categories of viewers, etc. We were thinking of OLAP as the solution as it keeps the data pre- processed. The problem we face is the following: User A watches from T0 to T20 User B watches from T0 to T10 User C watches from T10 to T20 If a single fact record represents 10 using of the time, A report for T0 to T10 will show 2 users A report for T10 to T20 shows 2 users A report for T0 to T20 would show an aggregated count of 2 + 2 = 4 which is wrong. This should ideally be 3 as 3 distinct users have watched between T0 and T20 Apart from this, the number of fact records created on a per 10 second timeframe would be very large. What would be an ideal solution to this problem? Is there a different way we should be creating fact data? Is there a different approach other than OLAP? The raw logs have the fields: UserID, advt ID, start time, end time. The records represent the time frame for which the advertisement was being watched without switching the channel. We were thinking of extracting out of this to create the fact records. Are there existing products that do something similar? Thanks, Yash |
|
#2
| |||
| |||
|
Responding to Yash... > We collect TV viewership data to provide details of ‘who watched what’ > to advertising agencies. A typical report would show the distribution > of the number of people who watched a given advertisement. The report > will be further sliced into channels, categories of viewers, etc. > We were thinking of OLAP as the solution as it keeps the data pre- > processed. > > The problem we face is the following: > User A watches from T0 to T20 > User B watches from T0 to T10 > User C watches from T10 to T20 > If a single fact record represents 10 using of the time, > A report for T0 to T10 will show 2 users > A report for T10 to T20 shows 2 users > A report for T0 to T20 would show an aggregated count of 2 + 2 = 4 > which is wrong. This should ideally be 3 as 3 distinct users have > watched between T0 and T20 > Apart from this, the number of fact records created on a per 10 second > timeframe would be very large. I think this is a basic design problem rather than a pattern problem. I also suspect that optimization will be a bigger problem that simply ensuring the viewers in the count are unique. [Advertisement] + adID | 1 | | R1 | | presented in | * [TimeFrame] + startTime + endTime | * | views in | R2 |--------- [LoginRecord] | | * [Viewer] + viewerID Basically the R2 association is managed with a sorted lookup table that maps viewers to time slices. So to get the count of unique viewers for any given advertisement, one just navigates the R1 -> R2 collections. Note that there are relatively few [TimeFrame] instances (8640 for 24 hrs). R1 is instantiated from knowing when the ads play. The raw records are only used to instantiate entries in the R2 lookup table. But that table is, at worst, O(number of viewers x number of time frames) in size. One can also play implementation games by making viewerID consecutive integers and encoding the times as integer codes to reduce the lookup table's physical size in memory. If there is a very large number of viewers (i.e., the log source is not a sampling), then you might want to preprocess by creating the lookup table temporarily in the DB from the raw data and then reading it in sequentially to actually count the viewers per ad. > > What would be an ideal solution to this problem? Is there a different > way we should be creating fact data? Is there a different approach > other than OLAP? > The raw logs have the fields: > UserID, advt ID, start time, end time. This seems strange. I would expect the raw data to be viewer logs without any adID because you would probably know when each ad played from other sources. IOW, adID may be a 3NF violation because it is dependent on {start time, end time}, depending on what the key of the raw log records is. If the raw data is coming from a viewer logging system, then I would expect the records to be naturally sorted on {start time, end time}, which would would be the key. That would make the construction of the R2 lookup table much easier. If you need to create the R2 lookup table in the DB, that should be fairly easy if the raw data is sorted conveniently. Since the lookup table is sorted, the counting can be done by reading the lookup table records sequentially and just incrementing a count. -- There is nothing wrong with me that could not be cured by a capful of Drano. H. S. Lahman hsl@pathfindermda.com Pathfinder Solutions http://www.pathfindermda.com blog: http://pathfinderpeople.blogs.com/hslahman "Model-Based Translation: The Next Step in Agile Development". Email info@pathfindermda.com for your copy. Pathfinder is hiring: http://www.pathfindermda.com/about_us/careers_pos3.php. (888)OOA-PATH |
|
#3
| |||
| |||
| "H. S. Lahman" news:BT3Ik.234$Rx2.69@nwrddc01.gnilink.net... > Responding to Yash... > >> We collect TV viewership data to provide details of ‘who watched what’ >> to advertising agencies. A typical report would show the distribution >> of the number of people who watched a given advertisement. The report >> will be further sliced into channels, categories of viewers, etc. >> We were thinking of OLAP as the solution as it keeps the data pre- >> processed. >> >> The problem we face is the following: >> User A watches from T0 to T20 >> User B watches from T0 to T10 >> User C watches from T10 to T20 >> If a single fact record represents 10 using of the time, >> A report for T0 to T10 will show 2 users >> A report for T10 to T20 shows 2 users >> A report for T0 to T20 would show an aggregated count of 2 + 2 = 4 >> which is wrong. This should ideally be 3 as 3 distinct users have >> watched between T0 and T20 >> Apart from this, the number of fact records created on a per 10 second >> timeframe would be very large. > > I think this is a basic design problem rather than a pattern problem. I > also suspect that optimization will be a bigger problem that simply > ensuring the viewers in the count are unique. > > [Advertisement] > + adID > | 1 > | > | R1 > | > | presented in > | * > [TimeFrame] > + startTime > + endTime > | * > | views in > | > R2 |--------- [LoginRecord] > | > | * > [Viewer] > + viewerID > > Basically the R2 association is managed with a sorted lookup table that > maps viewers to time slices. So to get the count of unique viewers for any > given advertisement, one just navigates the R1 -> R2 collections. > > Note that there are relatively few [TimeFrame] instances (8640 for 24 > hrs). R1 is instantiated from knowing when the ads play. The raw records > are only used to instantiate entries in the R2 lookup table. But that > table is, at worst, O(number of viewers x number of time frames) in size. > One can also play implementation games by making viewerID consecutive > integers and encoding the times as integer codes to reduce the lookup > table's physical size in memory. > > If there is a very large number of viewers (i.e., the log source is not a > sampling), then you might want to preprocess by creating the lookup table > temporarily in the DB from the raw data and then reading it in > sequentially to actually count the viewers per ad. > >> >> What would be an ideal solution to this problem? Is there a different >> way we should be creating fact data? Is there a different approach >> other than OLAP? >> The raw logs have the fields: >> UserID, advt ID, start time, end time. > > This seems strange. I would expect the raw data to be viewer logs without > any adID because you would probably know when each ad played from other > sources. IOW, adID may be a 3NF violation because it is dependent on > {start time, end time}, depending on what the key of the raw log records > is. > > If the raw data is coming from a viewer logging system, then I would > expect the records to be naturally sorted on {start time, end time}, which > would would be the key. That would make the construction of the R2 lookup > table much easier. If you need to create the R2 lookup table in the DB, > that should be fairly easy if the raw data is sorted conveniently. Since > the lookup table is sorted, the counting can be done by reading the lookup > table records sequentially and just incrementing a count. > > > -- > There is nothing wrong with me that could > not be cured by a capful of Drano. > > H. S. Lahman > hsl@pathfindermda.com > Pathfinder Solutions > http://www.pathfindermda.com > blog: http://pathfinderpeople.blogs.com/hslahman > "Model-Based Translation: The Next Step in Agile Development". Email > info@pathfindermda.com for your copy. > Pathfinder is hiring: > http://www.pathfindermda.com/about_us/careers_pos3.php. > (888)OOA-PATH My two cents... Do a web search on "overlapping time intervals". You'll find several solutions to the exact same problem. |
|
#4
| |||
| |||
|
I agree with the DB design proposed by H. S. Lahman. However, one thing to note is that the we do not want to query the relational DB to count the viewership of ads every time a report is submitted. Typical queries would be: Show the count of distinct people who watched XYZ ad between 10:00am and 11:am on 25th Sep Show the count of distinct people who watched XYZ ad between 10:00am and 10 m on 25th SepOLAP would keep the counts(measures) precomputed in a different structure(cube) and would not go to the DB for every request. The counts of individual minutes would add up to the count of the hour which futher adds up to the day. But the time overlap explained in the original post prevents us from using this technique as the aggregated counts woul be wrong. We would not get the distinct number of viewers. So the main challenge in the problem is how to pre-process the records to compute and aggregate the counts the way OLAP does. I am still looking for a this solution which may not involve OLAP but something similar. I could not find a solution on searching for "overlapping time intervals". The problem we are dealing with is not only to find overlapping time intervals but to aggregate the measures efficiently. Thanks, Yash Walter Mitty wrote: > "H. S. Lahman" > news:BT3Ik.234$Rx2.69@nwrddc01.gnilink.net... > > Responding to Yash... > > > >> We collect TV viewership data to provide details of �who watched what� > >> to advertising agencies. A typical report would show the distribution > >> of the number of people who watched a given advertisement. The report > >> will be further sliced into channels, categories of viewers, etc. > >> We were thinking of OLAP as the solution as it keeps the data pre- > >> processed. > >> > >> The problem we face is the following: > >> User A watches from T0 to T20 > >> User B watches from T0 to T10 > >> User C watches from T10 to T20 > >> If a single fact record represents 10 using of the time, > >> A report for T0 to T10 will show 2 users > >> A report for T10 to T20 shows 2 users > >> A report for T0 to T20 would show an aggregated count of 2 + 2 = 4 > >> which is wrong. This should ideally be 3 as 3 distinct users have > >> watched between T0 and T20 > >> Apart from this, the number of fact records created on a per 10 second > >> timeframe would be very large. > > > > I think this is a basic design problem rather than a pattern problem. I > > also suspect that optimization will be a bigger problem that simply > > ensuring the viewers in the count are unique. > > > > [Advertisement] > > + adID > > | 1 > > | > > | R1 > > | > > | presented in > > | * > > [TimeFrame] > > + startTime > > + endTime > > | * > > | views in > > | > > R2 |--------- [LoginRecord] > > | > > | * > > [Viewer] > > + viewerID > > > > Basically the R2 association is managed with a sorted lookup table that > > maps viewers to time slices. So to get the count of unique viewers for any > > given advertisement, one just navigates the R1 -> R2 collections. > > > > Note that there are relatively few [TimeFrame] instances (8640 for 24 > > hrs). R1 is instantiated from knowing when the ads play. The raw records > > are only used to instantiate entries in the R2 lookup table. But that > > table is, at worst, O(number of viewers x number of time frames) in size. > > One can also play implementation games by making viewerID consecutive > > integers and encoding the times as integer codes to reduce the lookup > > table's physical size in memory. > > > > If there is a very large number of viewers (i.e., the log source is nota > > sampling), then you might want to preprocess by creating the lookup table > > temporarily in the DB from the raw data and then reading it in > > sequentially to actually count the viewers per ad. > > > >> > >> What would be an ideal solution to this problem? Is there a different > >> way we should be creating fact data? Is there a different approach > >> other than OLAP? > >> The raw logs have the fields: > >> UserID, advt ID, start time, end time. > > > > This seems strange. I would expect the raw data to be viewer logs without > > any adID because you would probably know when each ad played from other > > sources. IOW, adID may be a 3NF violation because it is dependent on > > {start time, end time}, depending on what the key of the raw log records > > is. > > > > If the raw data is coming from a viewer logging system, then I would > > expect the records to be naturally sorted on {start time, end time}, which > > would would be the key. That would make the construction of the R2 lookup > > table much easier. If you need to create the R2 lookup table in the DB, > > that should be fairly easy if the raw data is sorted conveniently. Since > > the lookup table is sorted, the counting can be done by reading the lookup > > table records sequentially and just incrementing a count. > > > > > > -- > > There is nothing wrong with me that could > > not be cured by a capful of Drano. > > > > H. S. Lahman > > hsl@pathfindermda.com > > Pathfinder Solutions > > http://www.pathfindermda.com > > blog: http://pathfinderpeople.blogs.com/hslahman > > "Model-Based Translation: The Next Step in Agile Development". Email > > info@pathfindermda.com for your copy. > > Pathfinder is hiring: > > http://www.pathfindermda.com/about_us/careers_pos3.php. > > (888)OOA-PATH > > My two cents... > > Do a web search on "overlapping time intervals". You'll find several > solutions to the exact same problem. |
|
#5
| |||
| |||
|
Responding to Yash... > I agree with the DB design proposed by H. S. Lahman. However, one > thing to note is that the we do not want to query the relational DB to > count the viewership of ads every time a report is submitted. Typical > queries would be: > Show the count of distinct people who watched XYZ ad between 10:00am > and 11:am on 25th Sep > Show the count of distinct people who watched XYZ ad between 10:00am > and 10 m on 25th SepThere is nothing to prevent you from computing the counts once and storing them in the Timeframe objects. (Once the raw data is recorded for a time frame, the number of viewers won't change.) You are deliberately denormalizing the data but in this context that should be pretty benign precisely because the counts never change once they are computed. Then you can persist the TimeFrame objects separately and the individual requests just access those TimeFrame objects. Unless there is something else going on, that should not bog down the system very much since there are so few TimeFrame objects compared to raw data records. > > OLAP would keep the counts(measures) precomputed in a different > structure(cube) and would not go to the DB for every request. The > counts of individual minutes would add up to the count of the hour > which futher adds up to the day. Note that aggregation to larger time frame units is also a one-time computation and you can save those flavors of TimeFrame as well. Better yet, that aggregation is done from the existing TimeFrame objects rather than the raw data. > But the time overlap explained in the original post prevents us from > using this technique as the aggregated counts woul be wrong. We would > not get the distinct number of viewers. There is no free lunch. OLAP has to compute the counts so that solution will have to process the raw data, probably pretty much the same way as I suggested. IOW the queries will have to be more complicated. More to the point, OLAP has to save the counts somehow and that requires resources. In the end it is probably going to disk anyway with the OLAP server managing LRU access issues for the counts. -- There is nothing wrong with me that could not be cured by a capful of Drano. H. S. Lahman hsl@pathfindermda.com Pathfinder Solutions http://www.pathfindermda.com blog: http://pathfinderpeople.blogs.com/hslahman "Model-Based Translation: The Next Step in Agile Development". Email info@pathfindermda.com for your copy. Pathfinder is hiring: http://www.pathfindermda.com/about_us/careers_pos3.php. (888)OOA-PATH |
|
#6
| |||
| |||
|
It's certainly a factor. Some of these answers are almost comical in their irony. They epitomize the quotes you cite while denouncing them as false. Clearly you are thinking emotionally if you walk out your door, feel that it is cooler than normal, and make what you think is a "rational" decision to not accept what is happening on a global scale. Also people tend to focus only on what is tangible. This ties into my previous example. People focus almost solely on what is measurable and what they feel. Studies have been done on people where their brains were monitored for their reactions to money. |
![]() |
« Previous Thread
|
Next Thread »
| Thread Tools | |
| Display Modes | |
| |
All times are GMT -4. The time now is 04:27 PM.




m on 25th Sep
Linear Mode