Full text indexing recommendations - sqlserver-datamining
This is a discussion on Full text indexing recommendations - sqlserver-datamining ; Hi All, I am not quite sure if this is the correct place for this. Sorry if it is not. I am using SQL Server 2005. I have an items table that has 14 Million records in it and I ...
![]() |
| | LinkBack | Thread Tools | Display Modes |
|
#1
| |||
| |||
| Hi All, I am not quite sure if this is the correct place for this. Sorry if it is not. I am using SQL Server 2005. I have an items table that has 14 Million records in it and I have a requirement to allow the user to do a search of any part of a barcode data column, in this case it is called DisplayedItemID . In most cases the user will know at least 4 char. So when I run the following query: Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE DisplayedItemID LIKE '%6332%' It takes about 50 sec and returns 12700 rows. This is a simple example of what I am trying to achieve. What I am trying to do is compare this search to that of a full text index search. So I have create a full text index on the Items table and I started the population of the index last night and it is still going like 16 hours later. And when I run the following query I get no results at all. Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE CONTAINS(DisplayedItemID, N'*6332*') This to me means that it is still populating and it has not even reached the length of 4 char string yet. It is defiantly still populating the index. Here are a few of the example barcodes that are being held as data. 62459900/01 62457073 34597695 08633218 3563203959010 L356326036401H 3563321163016 1633214611015 Is full text indexing the right thing for this kind of data? Is it really going to help? Is there a way of speeding it up? Is this kind of data not what Full text indexing was meant for? Should I rather just increase the time out of the connection and the use LIKE '%6332%' Thanks ink |
|
#2
| |||
| |||
|
> Is full text indexing the right thing for this kind of data? I don't think a full text index is the right tool for the job. I think a better approach is create individual computed columns (perhaps persisted) with a SUBSTRING of DisplayedItemID. You can then index those columns individually for maximum performance and will get more accurate results since only the desired portion(s) of the bar code are searched. -- Hope this helps. Dan Guzman SQL Server MVP http://weblogs.sqlteam.com/dang/ "ink" news:Otmsh97HJHA.3736@TK2MSFTNGP06.phx.gbl... > > Hi All, > > I am not quite sure if this is the correct place for this. Sorry if it is > not. > > I am using SQL Server 2005. > > I have an items table that has 14 Million records in it and I have a > requirement to allow the user to do a search of any part of a barcode data > column, in this case it is called DisplayedItemID . > > In most cases the user will know at least 4 char. > > So when I run the following query: > Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE > DisplayedItemID LIKE '%6332%' > > It takes about 50 sec and returns 12700 rows. > This is a simple example of what I am trying to achieve. > > What I am trying to do is compare this search to that of a full text index > search. So I have create a full text index on the Items table and I > started the population of the index last night and it is still going like > 16 hours later. And when I run the following query I get no results at > all. > Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE > CONTAINS(DisplayedItemID, N'*6332*') > This to me means that it is still populating and it has not even reached > the length of 4 char string yet. It is defiantly still populating the > index. > > Here are a few of the example barcodes that are being held as data. > 62459900/01 > 62457073 > 34597695 > 08633218 > 3563203959010 > L356326036401H > 3563321163016 > 1633214611015 > > Is full text indexing the right thing for this kind of data? > Is it really going to help? > Is there a way of speeding it up? > Is this kind of data not what Full text indexing was meant for? > > Should I rather just increase the time out of the connection and the use > LIKE '%6332%' > > Thanks > ink > > > > |
|
#3
| |||
| |||
|
Full-text is built around words, and language conventions. I do not think it will help. Obviously the LIKE '%6332%' query does a scan now. Does it scan the entire table, or does it scan an index on DisplayedItemID? Is there a (non-clustered) index on DisplayedItemID? Roy Harvey Beacon Falls, CT On Fri, 26 Sep 2008 11:05:13 +0100, "ink" > >Hi All, > >I am not quite sure if this is the correct place for this. Sorry if it is >not. > >I am using SQL Server 2005. > >I have an items table that has 14 Million records in it and I have a >requirement to allow the user to do a search of any part of a barcode data >column, in this case it is called DisplayedItemID . > >In most cases the user will know at least 4 char. > >So when I run the following query: >Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE >DisplayedItemID LIKE '%6332%' > >It takes about 50 sec and returns 12700 rows. >This is a simple example of what I am trying to achieve. > >What I am trying to do is compare this search to that of a full text index >search. So I have create a full text index on the Items table and I started >the population of the index last night and it is still going like 16 hours >later. And when I run the following query I get no results at all. >Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE >CONTAINS(DisplayedItemID, N'*6332*') >This to me means that it is still populating and it has not even reached the >length of 4 char string yet. It is defiantly still populating the index. > >Here are a few of the example barcodes that are being held as data. >62459900/01 >62457073 >34597695 >08633218 >3563203959010 >L356326036401H >3563321163016 >1633214611015 > >Is full text indexing the right thing for this kind of data? >Is it really going to help? >Is there a way of speeding it up? >Is this kind of data not what Full text indexing was meant for? > >Should I rather just increase the time out of the connection and the use >LIKE '%6332%' > >Thanks >ink > > > |
|
#4
| |||
| |||
|
Hi Roy, Thanks for getting back, Yes there is an index on DisplayedItemID. However indexes are always ignored when you have a wild card as the first Char in the surch string. And yes it is doing a full table scan. Thanks, ink "Roy Harvey (SQL Server MVP)" news:r6ipd4dlh1j5m60sam8gvu13v8oj3o2uqq@4ax.com... > Full-text is built around words, and language conventions. I do not > think it will help. > > Obviously the LIKE '%6332%' query does a scan now. Does it scan the > entire table, or does it scan an index on DisplayedItemID? Is there a > (non-clustered) index on DisplayedItemID? > > Roy Harvey > Beacon Falls, CT > > On Fri, 26 Sep 2008 11:05:13 +0100, "ink" > >> >>Hi All, >> >>I am not quite sure if this is the correct place for this. Sorry if it is >>not. >> >>I am using SQL Server 2005. >> >>I have an items table that has 14 Million records in it and I have a >>requirement to allow the user to do a search of any part of a barcode data >>column, in this case it is called DisplayedItemID . >> >>In most cases the user will know at least 4 char. >> >>So when I run the following query: >>Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE >>DisplayedItemID LIKE '%6332%' >> >>It takes about 50 sec and returns 12700 rows. >>This is a simple example of what I am trying to achieve. >> >>What I am trying to do is compare this search to that of a full text index >>search. So I have create a full text index on the Items table and I >>started >>the population of the index last night and it is still going like 16 hours >>later. And when I run the following query I get no results at all. >>Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE >>CONTAINS(DisplayedItemID, N'*6332*') >>This to me means that it is still populating and it has not even reached >>the >>length of 4 char string yet. It is defiantly still populating the index. >> >>Here are a few of the example barcodes that are being held as data. >>62459900/01 >>62457073 >>34597695 >>08633218 >>3563203959010 >>L356326036401H >>3563321163016 >>1633214611015 >> >>Is full text indexing the right thing for this kind of data? >>Is it really going to help? >>Is there a way of speeding it up? >>Is this kind of data not what Full text indexing was meant for? >> >>Should I rather just increase the time out of the connection and the use >>LIKE '%6332%' >> >>Thanks >>ink >> >> >> |
|
#5
| |||
| |||
|
On Fri, 26 Sep 2008 06:34:40 -0500, "Dan Guzman" >> Is full text indexing the right thing for this kind of data? > >I don't think a full text index is the right tool for the job. I think a >better approach is create individual computed columns (perhaps persisted) >with a SUBSTRING of DisplayedItemID. You can then index those columns >individually for maximum performance and will get more accurate results >since only the desired portion(s) of the bar code are searched. I was thinking of creating another table with two columns. The first would be UPC_segment, the other would be the PK of the table being queried. For each UPC value there would be several rows; one with the whole UPC, a second starting with the second character, the third starting with the third character,etc. on down to whatever string is considered the reasonable minimum - 3 or 4 by my guess. The table would have a clustered index on both columns with the UPC segment coming first. The LIKE search would be LIKE '6332%' with only a trailing %, and of course it would have to be joined back to the original table. The longest UPC in the sample data shows was 14 characters, which would mean at least 11 rows in the table for that one. The shortest was 8 characters, so five or six rows for that, depending on whether the shortest used is the last 3 or 4 characters. Roy Harvey Beacon Falls, CT |
|
#6
| |||
| |||
|
> Yes there is an index on DisplayedItemID. However indexes are always > ignored when you have a wild card as the first Char in the surch string. > And yes it is doing a full table scan. I understand the implications of the leading wildcard search. I an suggesting that you break the DisplayedItemID bar code value into its atomic values rather than requiring the LIKE against the entire value. For example, ProductCode AS SUBSTRING(DisplayedItemID, 1, 6) SubItem AS SUBSTRING(DisplayedItemID, 7, 3) Users can then query the atomic values directly and you can persist/index as desired. This will avoid the table scan and performance will be much better, not to mention the benefits of a normalized database design. -- Hope this helps. Dan Guzman SQL Server MVP http://weblogs.sqlteam.com/dang/ "ink" news:OIJjSU9HJHA.1160@TK2MSFTNGP05.phx.gbl... > Hi Roy, > > Thanks for getting back, > > Yes there is an index on DisplayedItemID. However indexes are always > ignored when you have a wild card as the first Char in the surch string. > And yes it is doing a full table scan. > > Thanks, > ink > > > > > "Roy Harvey (SQL Server MVP)" > news:r6ipd4dlh1j5m60sam8gvu13v8oj3o2uqq@4ax.com... >> Full-text is built around words, and language conventions. I do not >> think it will help. >> >> Obviously the LIKE '%6332%' query does a scan now. Does it scan the >> entire table, or does it scan an index on DisplayedItemID? Is there a >> (non-clustered) index on DisplayedItemID? >> >> Roy Harvey >> Beacon Falls, CT >> >> On Fri, 26 Sep 2008 11:05:13 +0100, "ink" >> >>> >>>Hi All, >>> >>>I am not quite sure if this is the correct place for this. Sorry if it is >>>not. >>> >>>I am using SQL Server 2005. >>> >>>I have an items table that has 14 Million records in it and I have a >>>requirement to allow the user to do a search of any part of a barcode >>>data >>>column, in this case it is called DisplayedItemID . >>> >>>In most cases the user will know at least 4 char. >>> >>>So when I run the following query: >>>Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE >>>DisplayedItemID LIKE '%6332%' >>> >>>It takes about 50 sec and returns 12700 rows. >>>This is a simple example of what I am trying to achieve. >>> >>>What I am trying to do is compare this search to that of a full text >>>index >>>search. So I have create a full text index on the Items table and I >>>started >>>the population of the index last night and it is still going like 16 >>>hours >>>later. And when I run the following query I get no results at all. >>>Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE >>>CONTAINS(DisplayedItemID, N'*6332*') >>>This to me means that it is still populating and it has not even reached >>>the >>>length of 4 char string yet. It is defiantly still populating the index. >>> >>>Here are a few of the example barcodes that are being held as data. >>>62459900/01 >>>62457073 >>>34597695 >>>08633218 >>>3563203959010 >>>L356326036401H >>>3563321163016 >>>1633214611015 >>> >>>Is full text indexing the right thing for this kind of data? >>>Is it really going to help? >>>Is there a way of speeding it up? >>>Is this kind of data not what Full text indexing was meant for? >>> >>>Should I rather just increase the time out of the connection and the use >>>LIKE '%6332%' >>> >>>Thanks >>>ink >>> >>> >>> > |
|
#7
| |||
| |||
|
Hi Dan, Thanks for getting back. As the data is all different lengths and i cant be sure what part they are searching for. I am not sure how i would implement this. I have done a simple test using the following queries. Took: 41 sec Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE SUBSTRING(DisplayedItemID,2,4) ='6332' Took: 51 Sec Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE DisplayedItemID LIKE '__6332%' Took: 44 Sec Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE DisplayedItemID LIKE '%6332%' I need to try and get it down to under 20 Sec and as you can see there really wasnt much difference using SUBSTRING. If i was to create and indexed view of the data then i know it would be much faster but i have also lost the capability of searching for any part of the string. is the above sort of what you were thinking about? thanks, ink "Dan Guzman" news:OadGev8HJHA.4564@TK2MSFTNGP02.phx.gbl... >> Is full text indexing the right thing for this kind of data? > > I don't think a full text index is the right tool for the job. I think a > better approach is create individual computed columns (perhaps persisted) > with a SUBSTRING of DisplayedItemID. You can then index those columns > individually for maximum performance and will get more accurate results > since only the desired portion(s) of the bar code are searched. > > -- > Hope this helps. > > Dan Guzman > SQL Server MVP > http://weblogs.sqlteam.com/dang/ > > "ink" > news:Otmsh97HJHA.3736@TK2MSFTNGP06.phx.gbl... >> >> Hi All, >> >> I am not quite sure if this is the correct place for this. Sorry if it is >> not. >> >> I am using SQL Server 2005. >> >> I have an items table that has 14 Million records in it and I have a >> requirement to allow the user to do a search of any part of a barcode >> data column, in this case it is called DisplayedItemID . >> >> In most cases the user will know at least 4 char. >> >> So when I run the following query: >> Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE >> DisplayedItemID LIKE '%6332%' >> >> It takes about 50 sec and returns 12700 rows. >> This is a simple example of what I am trying to achieve. >> >> What I am trying to do is compare this search to that of a full text >> index search. So I have create a full text index on the Items table and I >> started the population of the index last night and it is still going like >> 16 hours later. And when I run the following query I get no results at >> all. >> Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE >> CONTAINS(DisplayedItemID, N'*6332*') >> This to me means that it is still populating and it has not even reached >> the length of 4 char string yet. It is defiantly still populating the >> index. >> >> Here are a few of the example barcodes that are being held as data. >> 62459900/01 >> 62457073 >> 34597695 >> 08633218 >> 3563203959010 >> L356326036401H >> 3563321163016 >> 1633214611015 >> >> Is full text indexing the right thing for this kind of data? >> Is it really going to help? >> Is there a way of speeding it up? >> Is this kind of data not what Full text indexing was meant for? >> >> Should I rather just increase the time out of the connection and the use >> LIKE '%6332%' >> >> Thanks >> ink >> >> >> >> > |
|
#8
| |||
| |||
|
On Fri, 26 Sep 2008 13:40:31 +0100, "ink" >Yes there is an index on DisplayedItemID. However indexes are always ignored >when you have a wild card as the first Char in the surch string. And yes it >is doing a full table scan. Sometimes it is possible to get an index scan, even with a leading % when the optimizer thinks the reduced I/O of scanning the smaller index will be worth it. A table with wide rows would be more likely to go that way. But I am not surprised you do not in this case. Roy Harvey Beacon Falls, CT |
|
#9
| |||
| |||
|
ink wrote on Fri, 26 Sep 2008 11:05:13 +0100: > I have an items table that has 14 Million records in it and I have a > requirement to allow the user to do a search of any part of a barcode > data column, in this case it is called DisplayedItemID . > In most cases the user will know at least 4 char. > So when I run the following query: > Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE > DisplayedItemID LIKE '%6332%' > It takes about 50 sec and returns 12700 rows. > This is a simple example of what I am trying to achieve. > What I am trying to do is compare this search to that of a full text > index search. So I have create a full text index on the Items table > and I started the population of the index last night and it is still > going like 16 hours later. And when I run the following query I get no > results at all. > Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE > CONTAINS(DisplayedItemID, N'*6332*') The initial catalog build can take a long time with very large tables as the FTS system is outside of SQL Server. This won't work. You can only use the * as a suffix, eg. 6332* will find all indexed "words" starting with the string 6332 You cannot use a * at the start of a "word". > This to me means that it is still populating and it has not even > reached the length of 4 char string yet. It is defiantly still > populating the index. > Here are a few of the example barcodes that are being held as data. > 62459900/01 62457073 34597695 08633218 3563203959010 > L356326036401H 3563321163016 1633214611015 The first will likely be split into the words 62459900 and 01 (although the noise word file might discard the 01 if it treats it as just 1). > Is full text indexing the right thing for this kind of data? Definitely not. > Is it really going to help? No. > Is this kind of data not what Full text indexing was meant for? For performing searches on whole words. For instance, I have a number of web sites with almost 1 million books listed on them. FTS enables my search to quickly find matches for words entered into searches, such as "windows", which would potentially take a lot longer doing scans with LIKE '% windows %' (assuming that words have spaces around them, but the FTS indexer will also break on other characters, depending on the language settings, such as punctuation - doing this with LIKE would sometimes mean looking for many different combinations of the word being searched and surrounding punctuation/whitespace). The FTS search can very quickly find the "windows" word entry in it's index and pass back the primary key values for the matching rows to SQL Server to then pull up the actual row data from the table. -- Dan |
|
#10
| |||
| |||
| I see what you mean. that does however mean that it could be a table with a max row count of 12 million * the max length of the DisplayID. I am not so sure my clients or my manager will go for that. "Roy Harvey (SQL Server MVP)" news:3hmpd4dlgpg6j3k9bu8gd96bv0hlkdvec4@4ax.com... > On Fri, 26 Sep 2008 06:34:40 -0500, "Dan Guzman" > > >>> Is full text indexing the right thing for this kind of data? >> >>I don't think a full text index is the right tool for the job. I think a >>better approach is create individual computed columns (perhaps persisted) >>with a SUBSTRING of DisplayedItemID. You can then index those columns >>individually for maximum performance and will get more accurate results >>since only the desired portion(s) of the bar code are searched. > > I was thinking of creating another table with two columns. The first > would be UPC_segment, the other would be the PK of the table being > queried. For each UPC value there would be several rows; one with the > whole UPC, a second starting with the second character, the third > starting with the third character,etc. on down to whatever string is > considered the reasonable minimum - 3 or 4 by my guess. The table > would have a clustered index on both columns with the UPC segment > coming first. The LIKE search would be LIKE '6332%' with only a > trailing %, and of course it would have to be joined back to the > original table. > > The longest UPC in the sample data shows was 14 characters, which > would mean at least 11 rows in the table for that one. The shortest > was 8 characters, so five or six rows for that, depending on whether > the shortest used is the last 3 or 4 characters. > > Roy Harvey > Beacon Falls, CT |
![]() |
« Previous Thread
|
Next Thread »
| Thread Tools | |
| Display Modes | |
| |
All times are GMT -4. The time now is 04:20 PM.




Linear Mode