dbaspot
Tags Register FAQ Calendar Search Today's Posts Mark Forums Read

Full text indexing recommendations - sqlserver-datamining

This is a discussion on Full text indexing recommendations - sqlserver-datamining ; Hi All, I am not quite sure if this is the correct place for this. Sorry if it is not. I am using SQL Server 2005. I have an items table that has 14 Million records in it and I ...


Home > Database Forum > Data Warehousing > sqlserver-datamining > Full text indexing recommendations

Reply

 

LinkBack Thread Tools Display Modes
  #1  
Old 09-26-2008, 06:05 AM
Database Bot
 
Join Date: Sep 2009
Posts: 1,236,254
Database Administrator is on a distinguished road
Default Full text indexing recommendations


Hi All,

I am not quite sure if this is the correct place for this. Sorry if it is
not.

I am using SQL Server 2005.

I have an items table that has 14 Million records in it and I have a
requirement to allow the user to do a search of any part of a barcode data
column, in this case it is called DisplayedItemID .

In most cases the user will know at least 4 char.

So when I run the following query:
Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE
DisplayedItemID LIKE '%6332%'

It takes about 50 sec and returns 12700 rows.
This is a simple example of what I am trying to achieve.

What I am trying to do is compare this search to that of a full text index
search. So I have create a full text index on the Items table and I started
the population of the index last night and it is still going like 16 hours
later. And when I run the following query I get no results at all.
Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE
CONTAINS(DisplayedItemID, N'*6332*')
This to me means that it is still populating and it has not even reached the
length of 4 char string yet. It is defiantly still populating the index.

Here are a few of the example barcodes that are being held as data.
62459900/01
62457073
34597695
08633218
3563203959010
L356326036401H
3563321163016
1633214611015

Is full text indexing the right thing for this kind of data?
Is it really going to help?
Is there a way of speeding it up?
Is this kind of data not what Full text indexing was meant for?

Should I rather just increase the time out of the connection and the use
LIKE '%6332%'

Thanks
ink




Reply With Quote
  #2  
Old 09-26-2008, 07:34 AM
Database Bot
 
Join Date: Sep 2009
Posts: 1,236,254
Database Administrator is on a distinguished road
Default Re: Full text indexing recommendations

> Is full text indexing the right thing for this kind of data?

I don't think a full text index is the right tool for the job. I think a
better approach is create individual computed columns (perhaps persisted)
with a SUBSTRING of DisplayedItemID. You can then index those columns
individually for maximum performance and will get more accurate results
since only the desired portion(s) of the bar code are searched.

--
Hope this helps.

Dan Guzman
SQL Server MVP
http://weblogs.sqlteam.com/dang/

"ink" wrote in message
news:Otmsh97HJHA.3736@TK2MSFTNGP06.phx.gbl...
>
> Hi All,
>
> I am not quite sure if this is the correct place for this. Sorry if it is
> not.
>
> I am using SQL Server 2005.
>
> I have an items table that has 14 Million records in it and I have a
> requirement to allow the user to do a search of any part of a barcode data
> column, in this case it is called DisplayedItemID .
>
> In most cases the user will know at least 4 char.
>
> So when I run the following query:
> Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE
> DisplayedItemID LIKE '%6332%'
>
> It takes about 50 sec and returns 12700 rows.
> This is a simple example of what I am trying to achieve.
>
> What I am trying to do is compare this search to that of a full text index
> search. So I have create a full text index on the Items table and I
> started the population of the index last night and it is still going like
> 16 hours later. And when I run the following query I get no results at
> all.
> Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE
> CONTAINS(DisplayedItemID, N'*6332*')
> This to me means that it is still populating and it has not even reached
> the length of 4 char string yet. It is defiantly still populating the
> index.
>
> Here are a few of the example barcodes that are being held as data.
> 62459900/01
> 62457073
> 34597695
> 08633218
> 3563203959010
> L356326036401H
> 3563321163016
> 1633214611015
>
> Is full text indexing the right thing for this kind of data?
> Is it really going to help?
> Is there a way of speeding it up?
> Is this kind of data not what Full text indexing was meant for?
>
> Should I rather just increase the time out of the connection and the use
> LIKE '%6332%'
>
> Thanks
> ink
>
>
>
>


Reply With Quote
  #3  
Old 09-26-2008, 07:38 AM
Database Bot
 
Join Date: Sep 2009
Posts: 1,236,254
Database Administrator is on a distinguished road
Default Re: Full text indexing recommendations

Full-text is built around words, and language conventions. I do not
think it will help.

Obviously the LIKE '%6332%' query does a scan now. Does it scan the
entire table, or does it scan an index on DisplayedItemID? Is there a
(non-clustered) index on DisplayedItemID?

Roy Harvey
Beacon Falls, CT

On Fri, 26 Sep 2008 11:05:13 +0100, "ink" wrote:

>
>Hi All,
>
>I am not quite sure if this is the correct place for this. Sorry if it is
>not.
>
>I am using SQL Server 2005.
>
>I have an items table that has 14 Million records in it and I have a
>requirement to allow the user to do a search of any part of a barcode data
>column, in this case it is called DisplayedItemID .
>
>In most cases the user will know at least 4 char.
>
>So when I run the following query:
>Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE
>DisplayedItemID LIKE '%6332%'
>
>It takes about 50 sec and returns 12700 rows.
>This is a simple example of what I am trying to achieve.
>
>What I am trying to do is compare this search to that of a full text index
>search. So I have create a full text index on the Items table and I started
>the population of the index last night and it is still going like 16 hours
>later. And when I run the following query I get no results at all.
>Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE
>CONTAINS(DisplayedItemID, N'*6332*')
>This to me means that it is still populating and it has not even reached the
>length of 4 char string yet. It is defiantly still populating the index.
>
>Here are a few of the example barcodes that are being held as data.
>62459900/01
>62457073
>34597695
>08633218
>3563203959010
>L356326036401H
>3563321163016
>1633214611015
>
>Is full text indexing the right thing for this kind of data?
>Is it really going to help?
>Is there a way of speeding it up?
>Is this kind of data not what Full text indexing was meant for?
>
>Should I rather just increase the time out of the connection and the use
>LIKE '%6332%'
>
>Thanks
>ink
>
>
>

Reply With Quote
  #4  
Old 09-26-2008, 08:40 AM
Database Bot
 
Join Date: Sep 2009
Posts: 1,236,254
Database Administrator is on a distinguished road
Default Re: Full text indexing recommendations

Hi Roy,

Thanks for getting back,

Yes there is an index on DisplayedItemID. However indexes are always ignored
when you have a wild card as the first Char in the surch string. And yes it
is doing a full table scan.

Thanks,
ink




"Roy Harvey (SQL Server MVP)" wrote in message
news:r6ipd4dlh1j5m60sam8gvu13v8oj3o2uqq@4ax.com...
> Full-text is built around words, and language conventions. I do not
> think it will help.
>
> Obviously the LIKE '%6332%' query does a scan now. Does it scan the
> entire table, or does it scan an index on DisplayedItemID? Is there a
> (non-clustered) index on DisplayedItemID?
>
> Roy Harvey
> Beacon Falls, CT
>
> On Fri, 26 Sep 2008 11:05:13 +0100, "ink" wrote:
>
>>
>>Hi All,
>>
>>I am not quite sure if this is the correct place for this. Sorry if it is
>>not.
>>
>>I am using SQL Server 2005.
>>
>>I have an items table that has 14 Million records in it and I have a
>>requirement to allow the user to do a search of any part of a barcode data
>>column, in this case it is called DisplayedItemID .
>>
>>In most cases the user will know at least 4 char.
>>
>>So when I run the following query:
>>Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE
>>DisplayedItemID LIKE '%6332%'
>>
>>It takes about 50 sec and returns 12700 rows.
>>This is a simple example of what I am trying to achieve.
>>
>>What I am trying to do is compare this search to that of a full text index
>>search. So I have create a full text index on the Items table and I
>>started
>>the population of the index last night and it is still going like 16 hours
>>later. And when I run the following query I get no results at all.
>>Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE
>>CONTAINS(DisplayedItemID, N'*6332*')
>>This to me means that it is still populating and it has not even reached
>>the
>>length of 4 char string yet. It is defiantly still populating the index.
>>
>>Here are a few of the example barcodes that are being held as data.
>>62459900/01
>>62457073
>>34597695
>>08633218
>>3563203959010
>>L356326036401H
>>3563321163016
>>1633214611015
>>
>>Is full text indexing the right thing for this kind of data?
>>Is it really going to help?
>>Is there a way of speeding it up?
>>Is this kind of data not what Full text indexing was meant for?
>>
>>Should I rather just increase the time out of the connection and the use
>>LIKE '%6332%'
>>
>>Thanks
>>ink
>>
>>
>>


Reply With Quote
  #5  
Old 09-26-2008, 08:57 AM
Database Bot
 
Join Date: Sep 2009
Posts: 1,236,254
Database Administrator is on a distinguished road
Default Re: Full text indexing recommendations

On Fri, 26 Sep 2008 06:34:40 -0500, "Dan Guzman"
wrote:

>> Is full text indexing the right thing for this kind of data?

>
>I don't think a full text index is the right tool for the job. I think a
>better approach is create individual computed columns (perhaps persisted)
>with a SUBSTRING of DisplayedItemID. You can then index those columns
>individually for maximum performance and will get more accurate results
>since only the desired portion(s) of the bar code are searched.


I was thinking of creating another table with two columns. The first
would be UPC_segment, the other would be the PK of the table being
queried. For each UPC value there would be several rows; one with the
whole UPC, a second starting with the second character, the third
starting with the third character,etc. on down to whatever string is
considered the reasonable minimum - 3 or 4 by my guess. The table
would have a clustered index on both columns with the UPC segment
coming first. The LIKE search would be LIKE '6332%' with only a
trailing %, and of course it would have to be joined back to the
original table.

The longest UPC in the sample data shows was 14 characters, which
would mean at least 11 rows in the table for that one. The shortest
was 8 characters, so five or six rows for that, depending on whether
the shortest used is the last 3 or 4 characters.

Roy Harvey
Beacon Falls, CT
Reply With Quote
  #6  
Old 09-26-2008, 09:00 AM
Database Bot
 
Join Date: Sep 2009
Posts: 1,236,254
Database Administrator is on a distinguished road
Default Re: Full text indexing recommendations

> Yes there is an index on DisplayedItemID. However indexes are always
> ignored when you have a wild card as the first Char in the surch string.
> And yes it is doing a full table scan.


I understand the implications of the leading wildcard search. I an
suggesting that you break the DisplayedItemID bar code value into its atomic
values rather than requiring the LIKE against the entire value. For
example,

ProductCode AS SUBSTRING(DisplayedItemID, 1, 6)
SubItem AS SUBSTRING(DisplayedItemID, 7, 3)

Users can then query the atomic values directly and you can persist/index as
desired. This will avoid the table scan and performance will be much
better, not to mention the benefits of a normalized database design.

--
Hope this helps.

Dan Guzman
SQL Server MVP
http://weblogs.sqlteam.com/dang/

"ink" wrote in message
news:OIJjSU9HJHA.1160@TK2MSFTNGP05.phx.gbl...
> Hi Roy,
>
> Thanks for getting back,
>
> Yes there is an index on DisplayedItemID. However indexes are always
> ignored when you have a wild card as the first Char in the surch string.
> And yes it is doing a full table scan.
>
> Thanks,
> ink
>
>
>
>
> "Roy Harvey (SQL Server MVP)" wrote in message
> news:r6ipd4dlh1j5m60sam8gvu13v8oj3o2uqq@4ax.com...
>> Full-text is built around words, and language conventions. I do not
>> think it will help.
>>
>> Obviously the LIKE '%6332%' query does a scan now. Does it scan the
>> entire table, or does it scan an index on DisplayedItemID? Is there a
>> (non-clustered) index on DisplayedItemID?
>>
>> Roy Harvey
>> Beacon Falls, CT
>>
>> On Fri, 26 Sep 2008 11:05:13 +0100, "ink" wrote:
>>
>>>
>>>Hi All,
>>>
>>>I am not quite sure if this is the correct place for this. Sorry if it is
>>>not.
>>>
>>>I am using SQL Server 2005.
>>>
>>>I have an items table that has 14 Million records in it and I have a
>>>requirement to allow the user to do a search of any part of a barcode
>>>data
>>>column, in this case it is called DisplayedItemID .
>>>
>>>In most cases the user will know at least 4 char.
>>>
>>>So when I run the following query:
>>>Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE
>>>DisplayedItemID LIKE '%6332%'
>>>
>>>It takes about 50 sec and returns 12700 rows.
>>>This is a simple example of what I am trying to achieve.
>>>
>>>What I am trying to do is compare this search to that of a full text
>>>index
>>>search. So I have create a full text index on the Items table and I
>>>started
>>>the population of the index last night and it is still going like 16
>>>hours
>>>later. And when I run the following query I get no results at all.
>>>Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE
>>>CONTAINS(DisplayedItemID, N'*6332*')
>>>This to me means that it is still populating and it has not even reached
>>>the
>>>length of 4 char string yet. It is defiantly still populating the index.
>>>
>>>Here are a few of the example barcodes that are being held as data.
>>>62459900/01
>>>62457073
>>>34597695
>>>08633218
>>>3563203959010
>>>L356326036401H
>>>3563321163016
>>>1633214611015
>>>
>>>Is full text indexing the right thing for this kind of data?
>>>Is it really going to help?
>>>Is there a way of speeding it up?
>>>Is this kind of data not what Full text indexing was meant for?
>>>
>>>Should I rather just increase the time out of the connection and the use
>>>LIKE '%6332%'
>>>
>>>Thanks
>>>ink
>>>
>>>
>>>

>


Reply With Quote
  #7  
Old 09-26-2008, 09:02 AM
Database Bot
 
Join Date: Sep 2009
Posts: 1,236,254
Database Administrator is on a distinguished road
Default Re: Full text indexing recommendations

Hi Dan,

Thanks for getting back.

As the data is all different lengths and i cant be sure what part they are
searching for. I am not sure how i would implement this.

I have done a simple test using the following queries.

Took: 41 sec
Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE
SUBSTRING(DisplayedItemID,2,4) ='6332'

Took: 51 Sec
Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE
DisplayedItemID LIKE '__6332%'

Took: 44 Sec
Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE
DisplayedItemID LIKE '%6332%'


I need to try and get it down to under 20 Sec and as you can see there
really wasnt much difference using SUBSTRING. If i was to create and indexed
view of the data then i know it would be much faster but i have also lost
the capability of searching for any part of the string.

is the above sort of what you were thinking about?

thanks,
ink




"Dan Guzman" wrote in message
news:OadGev8HJHA.4564@TK2MSFTNGP02.phx.gbl...
>> Is full text indexing the right thing for this kind of data?

>
> I don't think a full text index is the right tool for the job. I think a
> better approach is create individual computed columns (perhaps persisted)
> with a SUBSTRING of DisplayedItemID. You can then index those columns
> individually for maximum performance and will get more accurate results
> since only the desired portion(s) of the bar code are searched.
>
> --
> Hope this helps.
>
> Dan Guzman
> SQL Server MVP
> http://weblogs.sqlteam.com/dang/
>
> "ink" wrote in message
> news:Otmsh97HJHA.3736@TK2MSFTNGP06.phx.gbl...
>>
>> Hi All,
>>
>> I am not quite sure if this is the correct place for this. Sorry if it is
>> not.
>>
>> I am using SQL Server 2005.
>>
>> I have an items table that has 14 Million records in it and I have a
>> requirement to allow the user to do a search of any part of a barcode
>> data column, in this case it is called DisplayedItemID .
>>
>> In most cases the user will know at least 4 char.
>>
>> So when I run the following query:
>> Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE
>> DisplayedItemID LIKE '%6332%'
>>
>> It takes about 50 sec and returns 12700 rows.
>> This is a simple example of what I am trying to achieve.
>>
>> What I am trying to do is compare this search to that of a full text
>> index search. So I have create a full text index on the Items table and I
>> started the population of the index last night and it is still going like
>> 16 hours later. And when I run the following query I get no results at
>> all.
>> Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE
>> CONTAINS(DisplayedItemID, N'*6332*')
>> This to me means that it is still populating and it has not even reached
>> the length of 4 char string yet. It is defiantly still populating the
>> index.
>>
>> Here are a few of the example barcodes that are being held as data.
>> 62459900/01
>> 62457073
>> 34597695
>> 08633218
>> 3563203959010
>> L356326036401H
>> 3563321163016
>> 1633214611015
>>
>> Is full text indexing the right thing for this kind of data?
>> Is it really going to help?
>> Is there a way of speeding it up?
>> Is this kind of data not what Full text indexing was meant for?
>>
>> Should I rather just increase the time out of the connection and the use
>> LIKE '%6332%'
>>
>> Thanks
>> ink
>>
>>
>>
>>

>


Reply With Quote
  #8  
Old 09-26-2008, 09:02 AM
Database Bot
 
Join Date: Sep 2009
Posts: 1,236,254
Database Administrator is on a distinguished road
Default Re: Full text indexing recommendations

On Fri, 26 Sep 2008 13:40:31 +0100, "ink" wrote:

>Yes there is an index on DisplayedItemID. However indexes are always ignored
>when you have a wild card as the first Char in the surch string. And yes it
>is doing a full table scan.


Sometimes it is possible to get an index scan, even with a leading %
when the optimizer thinks the reduced I/O of scanning the smaller
index will be worth it. A table with wide rows would be more likely
to go that way. But I am not surprised you do not in this case.

Roy Harvey
Beacon Falls, CT
Reply With Quote
  #9  
Old 09-26-2008, 09:06 AM
Database Bot
 
Join Date: Sep 2009
Posts: 1,236,254
Database Administrator is on a distinguished road
Default Re: Full text indexing recommendations

ink wrote on Fri, 26 Sep 2008 11:05:13 +0100:


> I have an items table that has 14 Million records in it and I have a
> requirement to allow the user to do a search of any part of a barcode
> data column, in this case it is called DisplayedItemID .


> In most cases the user will know at least 4 char.


> So when I run the following query:
> Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE
> DisplayedItemID LIKE '%6332%'


> It takes about 50 sec and returns 12700 rows.
> This is a simple example of what I am trying to achieve.


> What I am trying to do is compare this search to that of a full text
> index search. So I have create a full text index on the Items table
> and I started the population of the index last night and it is still
> going like 16 hours later. And when I run the following query I get no
> results at all.
> Select InternalConsignmentID, DisplayedItemID From ITEMS WHERE
> CONTAINS(DisplayedItemID, N'*6332*')


The initial catalog build can take a long time with very large tables as the
FTS system is outside of SQL Server.

This won't work. You can only use the * as a suffix, eg.

6332* will find all indexed "words" starting with the string 6332

You cannot use a * at the start of a "word".

> This to me means that it is still populating and it has not even
> reached the length of 4 char string yet. It is defiantly still
> populating the index.


> Here are a few of the example barcodes that are being held as data.
> 62459900/01 62457073 34597695 08633218 3563203959010
> L356326036401H 3563321163016 1633214611015


The first will likely be split into the words 62459900 and 01 (although the
noise word file might discard the 01 if it treats it as just 1).

> Is full text indexing the right thing for this kind of data?


Definitely not.

> Is it really going to help?


No.

> Is this kind of data not what Full text indexing was meant for?


For performing searches on whole words. For instance, I have a number of web
sites with almost 1 million books listed on them. FTS enables my search to
quickly find matches for words entered into searches, such as "windows",
which would potentially take a lot longer doing scans with LIKE '% windows
%' (assuming that words have spaces around them, but the FTS indexer will
also break on other characters, depending on the language settings, such as
punctuation - doing this with LIKE would sometimes mean looking for many
different combinations of the word being searched and surrounding
punctuation/whitespace). The FTS search can very quickly find the "windows"
word entry in it's index and pass back the primary key values for the
matching rows to SQL Server to then pull up the actual row data from the
table.

--
Dan


Reply With Quote
  #10  
Old 09-26-2008, 09:08 AM
Database Bot
 
Join Date: Sep 2009
Posts: 1,236,254
Database Administrator is on a distinguished road
Default Re: Full text indexing recommendations


I see what you mean. that does however mean that it could be a table with a
max row count of 12 million * the max length of the DisplayID. I am not so
sure my clients or my manager will go for that.





"Roy Harvey (SQL Server MVP)" wrote in message
news:3hmpd4dlgpg6j3k9bu8gd96bv0hlkdvec4@4ax.com...
> On Fri, 26 Sep 2008 06:34:40 -0500, "Dan Guzman"
> wrote:
>
>>> Is full text indexing the right thing for this kind of data?

>>
>>I don't think a full text index is the right tool for the job. I think a
>>better approach is create individual computed columns (perhaps persisted)
>>with a SUBSTRING of DisplayedItemID. You can then index those columns
>>individually for maximum performance and will get more accurate results
>>since only the desired portion(s) of the bar code are searched.

>
> I was thinking of creating another table with two columns. The first
> would be UPC_segment, the other would be the PK of the table being
> queried. For each UPC value there would be several rows; one with the
> whole UPC, a second starting with the second character, the third
> starting with the third character,etc. on down to whatever string is
> considered the reasonable minimum - 3 or 4 by my guess. The table
> would have a clustered index on both columns with the UPC segment
> coming first. The LIKE search would be LIKE '6332%' with only a
> trailing %, and of course it would have to be joined back to the
> original table.
>
> The longest UPC in the sample data shows was 14 characters, which
> would mean at least 11 rows in the table for that one. The shortest
> was 8 characters, so five or six rows for that, depending on whether
> the shortest used is the last 3 or 4 characters.
>
> Roy Harvey
> Beacon Falls, CT


Reply With Quote
Reply

Thread Tools
Display Modes



All times are GMT -4. The time now is 04:20 PM.