Full text indexing recommendations - sqlserver-datamining
This is a discussion on Full text indexing recommendations - sqlserver-datamining ; On Fri, 26 Sep 2008 14:08:30 +0100, "ink" wrote: >I see what you mean. that does however mean that it could be a table with a >max row count of 12 million * the max length of the DisplayID. I ...
![]() |
| | LinkBack | Thread Tools | Display Modes |
|
#11
| |||
| |||
| >I see what you mean. that does however mean that it could be a table with a >max row count of 12 million * the max length of the DisplayID. I am not so >sure my clients or my manager will go for that. It sounds worse than it is, though it is still pretty bad. And since it would make no sense to include one or two (and possibly three) character strings it would be 12 million * (actual length - 2) rows. Even 120 million short rows can be manageable. If it was hidden in an index they might not think twice about it. You even might compare the size to the size of the full-text index you were trying, if it every finishes. Roy Harvey Beacon Falls, CT |
|
#12
| |||
| |||
|
Thanks Daniel, I didnt know that i couldnt use the wild card in the front. That means i cant us Full text indexing even if i wanted to. O well. I really think that i am just going to have to up the connection time out for this query. ink "Daniel Crichton" news:uWptgi9HJHA.1756@TK2MSFTNGP04.phx.gbl... > ink wrote on Fri, 26 Sep 2008 11:05:13 +0100: > > >> I have an items table that has 14 Million records in it and I have a >> requirement to allow the user to do a search of any part of a barcode >> data column, in this case it is called DisplayedItemID . > >> In most cases the user will know at least 4 char. > >> So when I run the following query: >> Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE >> DisplayedItemID LIKE '%6332%' > >> It takes about 50 sec and returns 12700 rows. >> This is a simple example of what I am trying to achieve. > >> What I am trying to do is compare this search to that of a full text >> index search. So I have create a full text index on the Items table >> and I started the population of the index last night and it is still >> going like 16 hours later. And when I run the following query I get no >> results at all. >> Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE >> CONTAINS(DisplayedItemID, N'*6332*') > > The initial catalog build can take a long time with very large tables as > the FTS system is outside of SQL Server. > > This won't work. You can only use the * as a suffix, eg. > > 6332* will find all indexed "words" starting with the string 6332 > > You cannot use a * at the start of a "word". > >> This to me means that it is still populating and it has not even >> reached the length of 4 char string yet. It is defiantly still >> populating the index. > >> Here are a few of the example barcodes that are being held as data. >> 62459900/01 62457073 34597695 08633218 3563203959010 >> L356326036401H 3563321163016 1633214611015 > > The first will likely be split into the words 62459900 and 01 (although > the noise word file might discard the 01 if it treats it as just 1). > >> Is full text indexing the right thing for this kind of data? > > Definitely not. > >> Is it really going to help? > > No. > >> Is this kind of data not what Full text indexing was meant for? > > For performing searches on whole words. For instance, I have a number of > web sites with almost 1 million books listed on them. FTS enables my > search to quickly find matches for words entered into searches, such as > "windows", which would potentially take a lot longer doing scans with LIKE > '% windows %' (assuming that words have spaces around them, but the FTS > indexer will also break on other characters, depending on the language > settings, such as punctuation - doing this with LIKE would sometimes mean > looking for many different combinations of the word being searched and > surrounding punctuation/whitespace). The FTS search can very quickly find > the "windows" word entry in it's index and pass back the primary key > values for the matching rows to SQL Server to then pull up the actual row > data from the table. > > -- > Dan > |
|
#13
| |||
| |||
| As i thought, its a no. Fantastic Idea though. I really think the solution is going to have to be just upping the Query time out limit. thanks, ink "ink" news:OhIe7j9HJHA.4232@TK2MSFTNGP03.phx.gbl... > > I see what you mean. that does however mean that it could be a table with > a max row count of 12 million * the max length of the DisplayID. I am not > so sure my clients or my manager will go for that. > > > > > > "Roy Harvey (SQL Server MVP)" > news:3hmpd4dlgpg6j3k9bu8gd96bv0hlkdvec4@4ax.com... >> On Fri, 26 Sep 2008 06:34:40 -0500, "Dan Guzman" >> >> >>>> Is full text indexing the right thing for this kind of data? >>> >>>I don't think a full text index is the right tool for the job. I think a >>>better approach is create individual computed columns (perhaps persisted) >>>with a SUBSTRING of DisplayedItemID. You can then index those columns >>>individually for maximum performance and will get more accurate results >>>since only the desired portion(s) of the bar code are searched. >> >> I was thinking of creating another table with two columns. The first >> would be UPC_segment, the other would be the PK of the table being >> queried. For each UPC value there would be several rows; one with the >> whole UPC, a second starting with the second character, the third >> starting with the third character,etc. on down to whatever string is >> considered the reasonable minimum - 3 or 4 by my guess. The table >> would have a clustered index on both columns with the UPC segment >> coming first. The LIKE search would be LIKE '6332%' with only a >> trailing %, and of course it would have to be joined back to the >> original table. >> >> The longest UPC in the sample data shows was 14 characters, which >> would mean at least 11 rows in the table for that one. The shortest >> was 8 characters, so five or six rows for that, depending on whether >> the shortest used is the last 3 or 4 characters. >> >> Roy Harvey >> Beacon Falls, CT > |
|
#14
| |||
| |||
| Yeah your right i dint think it ever will end. But at least i learned something. first time i have ever even looked at full text indexing. "Roy Harvey (SQL Server MVP)" news:tdopd4pbr0h11b30e8nd9hnb4t4qimc5r9@4ax.com... > On Fri, 26 Sep 2008 14:08:30 +0100, "ink" > >>I see what you mean. that does however mean that it could be a table with >>a >>max row count of 12 million * the max length of the DisplayID. I am not so >>sure my clients or my manager will go for that. > > It sounds worse than it is, though it is still pretty bad. And since > it would make no sense to include one or two (and possibly three) > character strings it would be 12 million * (actual length - 2) rows. > Even 120 million short rows can be manageable. If it was hidden in an > index they might not think twice about it. You even might compare the > size to the size of the full-text index you were trying, if it every > finishes. > > Roy Harvey > Beacon Falls, CT |
|
#15
| |||
| |||
|
Check the SQL Server help files, it's always worth reading up there before implementing something you're not sure of. You'll find that it clearly states that you can only ever use * at the end of words, and all the examples using the wildcard reinforce this. Dan ink wrote on Fri, 26 Sep 2008 14:25:52 +0100: > Thanks Daniel, > I didnt know that i couldnt use the wild card in the front. > That means i cant us Full text indexing even if i wanted to. O well. > I really think that i am just going to have to up the connection time > out for this query. > ink > "Daniel Crichton" >> ink wrote on Fri, 26 Sep 2008 11:05:13 +0100: >>> I have an items table that has 14 Million records in it and I have a >>> requirement to allow the user to do a search of any part of a >>> barcode data column, in this case it is called DisplayedItemID . >>> In most cases the user will know at least 4 char. >>> So when I run the following query: >>> Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE >>> DisplayedItemID LIKE '%6332%' >>> It takes about 50 sec and returns 12700 rows. >>> This is a simple example of what I am trying to achieve. >>> What I am trying to do is compare this search to that of a full text >>> index search. So I have create a full text index on the Items table >>> and I started the population of the index last night and it is >>> still going like 16 hours later. And when I run the following query >>> I get no results at all. >>> Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE >>> CONTAINS(DisplayedItemID, N'*6332*') >> The initial catalog build can take a long time with very large tables >> as the FTS system is outside of SQL Server. >> This won't work. You can only use the * as a suffix, eg. >> 6332* will find all indexed "words" starting with the string 6332 >> You cannot use a * at the start of a "word". >>> This to me means that it is still populating and it has not even >>> reached the length of 4 char string yet. It is defiantly still >>> populating the index. >>> Here are a few of the example barcodes that are being held as data. >>> 62459900/01 62457073 34597695 08633218 3563203959010 >>> L356326036401H 3563321163016 1633214611015 >> The first will likely be split into the words 62459900 and 01 >> (although the noise word file might discard the 01 if it treats it >> as just 1). >>> Is full text indexing the right thing for this kind of data? >> Definitely not. >>> Is it really going to help? >> No. >>> Is this kind of data not what Full text indexing was meant for? >> For performing searches on whole words. For instance, I have a number >> of web sites with almost 1 million books listed on them. FTS enables >> my search to quickly find matches for words entered into searches, >> such as "windows", which would potentially take a lot longer doing >> scans with LIKE '% windows %' (assuming that words have spaces >> around them, but the FTS indexer will also break on other >> characters, depending on the language settings, such as punctuation >> - doing this with LIKE would sometimes mean looking for many >> different combinations of the word being searched and surrounding >> punctuation/whitespace). The FTS search can very quickly find the >> "windows" word entry in it's index and pass back the primary key >> values for the matching rows to SQL Server to then pull up the actual >> row data from the table. >> -- >> Dan |
|
#16
| |||
| |||
|
Full text indexing is not the right solution here. If you could limit the users to, say, the last 4 digits of the barcode or the first 4 digits of the barcode, or something to that effect in their searches you could separate that out and take advantage of indexing strategies. Alternatively you could create some sort of n-gram strategy that would take advantage of indexes, although your storage requirements would go up quite a bit. "ink" news:Otmsh97HJHA.3736@TK2MSFTNGP06.phx.gbl... > > Hi All, > > I am not quite sure if this is the correct place for this. Sorry if it is > not. > > I am using SQL Server 2005. > > I have an items table that has 14 Million records in it and I have a > requirement to allow the user to do a search of any part of a barcode data > column, in this case it is called DisplayedItemID . > > In most cases the user will know at least 4 char. > > So when I run the following query: > Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE > DisplayedItemID LIKE '%6332%' > > It takes about 50 sec and returns 12700 rows. > This is a simple example of what I am trying to achieve. > > What I am trying to do is compare this search to that of a full text index > search. So I have create a full text index on the Items table and I > started the population of the index last night and it is still going like > 16 hours later. And when I run the following query I get no results at > all. > Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE > CONTAINS(DisplayedItemID, N'*6332*') > This to me means that it is still populating and it has not even reached > the length of 4 char string yet. It is defiantly still populating the > index. > > Here are a few of the example barcodes that are being held as data. > 62459900/01 > 62457073 > 34597695 > 08633218 > 3563203959010 > L356326036401H > 3563321163016 > 1633214611015 > > Is full text indexing the right thing for this kind of data? > Is it really going to help? > Is there a way of speeding it up? > Is this kind of data not what Full text indexing was meant for? > > Should I rather just increase the time out of the connection and the use > LIKE '%6332%' > > Thanks > ink > > > > |
![]() |
« Previous Thread
|
Next Thread »
| Thread Tools | |
| Display Modes | |
| |
All times are GMT -4. The time now is 03:45 PM.




Linear Mode