CRC calculation per record possible? - filemaker
This is a discussion on CRC calculation per record possible? - filemaker ; Dear Listeners: I have a database with banking records (date, amount, paid/received, bank account from, bank account to, 2 fields with comments, currency code), with a total length of 263 bytes each, distributed over 9 fields with variable length (sometimes ...
![]() |
| | LinkBack | Thread Tools | Display Modes |
|
#1
| |||
| |||
| I have a database with banking records (date, amount, paid/received, bank account from, bank account to, 2 fields with comments, currency code), with a total length of 263 bytes each, distributed over 9 fields with variable length (sometimes even empty). I need to weed out any possible 100% duplicates that may have been/will be imported by "mishap". The non-duplicate information can unfortunately be positioned anywhere along the stretch of those 263 bytes. E.g. byte 211 may be the only one being different between e.g. 2 or 3 otherwise identical records. Therefore the usual duplicate search with "!" is not working due to the great length of the string, even not in FMP 9 with its 100 bytes "discrimination" capability. I now thought of using a CRC/Cyclic Redundancy Check *per record* to find those 100% duplicates. Is that a) a good idea and b) feasible within an FMP calc field, and if so, can someone point me into the right direction how to implement this? I am also thinking of using some existing outside tool - a command line (Windows) to which FMP can write and from which it could read. Does anybody have any good advice on that? This is preferably FMP 6 but FMP 9 would be possible, too. The time a CRC calculation of all records would take is not critical; the number of records would be in the order of magnitude of several 10,000's. Your suggestions are very welcome. -- Met vriendelijke groet / Mit freundlichen Gruessen / With kind regards Christoph Bouthillier p o s t <> oh-no-spam t e k s t o t a a l << d o t >> c o m Forget the oh-no-spam |
|
#2
| |||
| |||
| "Christoph Bouthillier" news:4b0eb$49137465$c3f0d641$8647@news.speedlinq.n l... > Dear Listeners: > > I have a database with banking records (date, amount, paid/received, bank > account from, bank account to, 2 fields with comments, currency code), with > a total length of 263 bytes each, distributed over 9 fields with variable > length (sometimes even empty). > > I need to weed out any possible 100% duplicates that may have been/will be > imported by "mishap". The non-duplicate information can unfortunately be > positioned anywhere along the stretch of those 263 bytes. E.g. byte 211 may > be the only one being different between e.g. 2 or 3 otherwise identical > records. Therefore the usual duplicate search with "!" is not working due to > the great length of the string, even not in FMP 9 with its 100 bytes > "discrimination" capability. > > I now thought of using a CRC/Cyclic Redundancy Check *per record* to find > those 100% duplicates. Is that > a) a good idea > and > b) feasible within an FMP calc field, and if so, can someone point me into > the right direction how to implement this? > > I am also thinking of using some existing outside tool - a command line > (Windows) to which FMP can write and from which it could read. Does anybody > have any good advice on that? > > This is preferably FMP 6 but FMP 9 would be possible, too. > > The time a CRC calculation of all records would take is not critical; the > number of records would be in the order of magnitude of several 10,000's. > > Your suggestions are very welcome. If all the data in the various fields are each less than FileMaker's length limit, then it would be simpler to just create a Script that performs a Find with the ! duplicate symbol in every field. ie. Enter Find Mode [] Set Field [Date, "!"] Set Field [Amount, "!"] Set Field [BankFrom, "!"] etc. Perform Find [] Once you've set it up the Script can be easily run whenever you need to. Helpful Harry Still in Windows "wonder"land |
|
#3
| |||
| |||
|
(Please see below) -- "Your Name" news:gevvud$ebl$1@lust.ihug.co.nz... > > "Christoph Bouthillier" > news:4b0eb$49137465$c3f0d641$8647@news.speedlinq.n l... >> Dear Listeners: >> >> I have a database with banking records (date, amount, paid/received, bank >> account from, bank account to, 2 fields with comments, currency code), > with >> a total length of 263 bytes each, distributed over 9 fields with variable >> length (sometimes even empty). >> >> I need to weed out any possible 100% duplicates that may have been/will >> be >> imported by "mishap". The non-duplicate information can unfortunately be >> positioned anywhere along the stretch of those 263 bytes. E.g. byte 211 > may >> be the only one being different between e.g. 2 or 3 otherwise identical >> records. Therefore the usual duplicate search with "!" is not working due > to >> the great length of the string, even not in FMP 9 with its 100 bytes >> "discrimination" capability. >> >> I now thought of using a CRC/Cyclic Redundancy Check *per record* to find >> those 100% duplicates. Is that >> a) a good idea >> and >> b) feasible within an FMP calc field, and if so, can someone point me >> into >> the right direction how to implement this? >> >> I am also thinking of using some existing outside tool - a command line >> (Windows) to which FMP can write and from which it could read. Does > anybody >> have any good advice on that? >> >> This is preferably FMP 6 but FMP 9 would be possible, too. >> >> The time a CRC calculation of all records would take is not critical; the >> number of records would be in the order of magnitude of several 10,000's. >> >> Your suggestions are very welcome. > > If all the data in the various fields are each less than FileMaker's > length > limit, then it would be simpler to just create a Script that performs a > Find > with the ! duplicate symbol in every field. > ie. > Enter Find Mode [] > Set Field [Date, "!"] > Set Field [Amount, "!"] > Set Field [BankFrom, "!"] > etc. > Perform Find [] > > Once you've set it up the Script can be easily run whenever you need to. > > Helpful Harry > Still in Windows "wonder"land > > Hello Harry, Thanks for your suggestion but unfortunately one of the fields is 159 bytes long ;=( I found on the net some C(+(+)) code for the calculation of CRCs (which would then be short and unique per record) but since I have no clue if/how it is possible to implement such code in an FMP calc field I was/am still hoping someone can help me out. Met vriendelijke groet / Mit freundlichen Gruessen / With kind regards Christoph Bouthillier p o s t <> oh-no-spam t e k s t o t a a l << d o t >> c o m Forget the oh-no-spam |
|
#4
| |||
| |||
| "Christoph Bouthillier" news:a7cac$4913877f$c3f0d641$2099@news.speedlinq.n l... > "Your Name" > news:gevvud$ebl$1@lust.ihug.co.nz... > > "Christoph Bouthillier" > > news:4b0eb$49137465$c3f0d641$8647@news.speedlinq.n l... > >> > >> Dear Listeners: > >> > >> I have a database with banking records (date, amount, paid/received, bank > >> account from, bank account to, 2 fields with comments, currency code), > >> with a total length of 263 bytes each, distributed over 9 fields with variable > >> length (sometimes even empty). > >> > >> I need to weed out any possible 100% duplicates that may have been/will > >> be imported by "mishap". The non-duplicate information can unfortunately > >> be positioned anywhere along the stretch of those 263 bytes. E.g. byte > >> 211 may be the only one being different between e.g. 2 or 3 otherwise > >> identical records. Therefore the usual duplicate search with "!" is not > >> working due to the great length of the string, even not in FMP 9 with > >> its 100 bytes "discrimination" capability. > >> > > > > If all the data in the various fields are each less than FileMaker's > > length limit, then it would be simpler to just create a Script that > > performs a Find with the ! duplicate symbol in every field. > > ie. > > Enter Find Mode [] > > Set Field [Date, "!"] > > Set Field [Amount, "!"] > > Set Field [BankFrom, "!"] > > etc. > > Perform Find [] > > > > Once you've set it up the Script can be easily run whenever you > > need to. > > Hello Harry, > > Thanks for your suggestion but unfortunately one of the fields is 159 bytes > long ;=( The easiest way around this would be to split it into two, ie. ShortField1 = Left (LongField, 75) ShortField2 = Substitute(LongField, ShortField1, "") and then just use the ! duplicate symbol in these two Fields instead of the longer one. These two Fields do not have to be on any Layout for the Script to still work. > I found on the net some C(+(+)) code for the calculation of CRCs (which > would then be short and unique per record) but since I have no clue if/how > it is possible to implement such code in an FMP calc field I was/am still > hoping someone can help me out. It depends on how the C code does it, but it may be possible, although it would probably be quite a bit longer. Helpful Harry Still in Windows "wonder"land |
|
#5
| |||
| |||
|
Christoph, A CRC (or hash) is a very complex maths solution. You wouldn't be able to reproduce that. But with some work you could reproduce the Adler control. Which is more basic, but would do the trick for you. First get the ASC customfunction from http://www.briandunning.com/cf/283 (Hope you have fmp advanced, because you need it too implement) Create a second customfunction MakeAscii ( text ; textlength ) containing: Case ( textlength > 0 ; MakeAscii ( text ; textlength -1 ) & ASC ( Middle ( text ; textlength; 1) ) & "¶"; "0" ) Fields: text (Holds any text) TextAscii < calculation ; text ; MakeAscii ( text ; length ( text ) ) > RESULT now create the following script: createSUM with the following steps. Loop Set Variable [ $Source; Value:textAscii ] Loop Set Variable [ $A; Value:If ( IsEmpty ($A ) ; LeftValues ( $Source ; 1) + 1 ; $A + LeftValues ( $Source ; 1) ) ] Set Variable [ $B; Value:If ( IsEmpty ( $B) ; $A ; $B + $A ) ] Set Variable [ $Source; Value:RightValues ( $Source ; ValueCount ( $Source ) -1 ) ] Exit Loop If [ IsEmpty ( $Source) ] End Loop Set Field [ test::result; ($B * 65536) + $A ] goto record (next/exit after last) end loop Done Keep well, Ursus "Christoph Bouthillier" news:4b0eb$49137465$c3f0d641$8647@news.speedlinq.n l... > Dear Listeners: > > I have a database with banking records (date, amount, paid/received, bank > account from, bank account to, 2 fields with comments, currency code), > with a total length of 263 bytes each, distributed over 9 fields with > variable length (sometimes even empty). > > I need to weed out any possible 100% duplicates that may have been/will be > imported by "mishap". The non-duplicate information can unfortunately be > positioned anywhere along the stretch of those 263 bytes. E.g. byte 211 > may be the only one being different between e.g. 2 or 3 otherwise > identical records. Therefore the usual duplicate search with "!" is not > working due to the great length of the string, even not in FMP 9 with its > 100 bytes "discrimination" capability. > > I now thought of using a CRC/Cyclic Redundancy Check *per record* to find > those 100% duplicates. Is that > a) a good idea > and > b) feasible within an FMP calc field, and if so, can someone point me into > the right direction how to implement this? > > I am also thinking of using some existing outside tool - a command line > (Windows) to which FMP can write and from which it could read. Does > anybody have any good advice on that? > > This is preferably FMP 6 but FMP 9 would be possible, too. > > The time a CRC calculation of all records would take is not critical; the > number of records would be in the order of magnitude of several 10,000's. > > Your suggestions are very welcome. > > -- > Met vriendelijke groet / Mit freundlichen Gruessen / With kind regards > Christoph Bouthillier > p o s t <> oh-no-spam t e k s t o t a a l << d o t >> c o m > Forget the oh-no-spam > |
|
#6
| |||
| |||
|
If you plan using larger texts you might substute the SET FIELD for: (Mod ( $B ; 65521 ) * 65536) + Mod ( $a ; 65521 ) 65521 being the largest prime under 2^16 keeps it as realiable, and more readable "Ursus" news:83ee8$491495f5$5ed017f3$21897@cache6.tilbu1.n b.home.nl... > Christoph, > > A CRC (or hash) is a very complex maths solution. You wouldn't be able to > reproduce that. But with some work you could reproduce the Adler control. > Which is more basic, but would do the trick for you. > > First get the ASC customfunction from http://www.briandunning.com/cf/283 > (Hope you have fmp advanced, because you need it too implement) > Create a second customfunction MakeAscii ( text ; textlength ) containing: > Case ( textlength > 0 ; MakeAscii ( text ; textlength -1 ) & ASC ( Middle > ( text ; textlength; 1) ) & "¶"; "0" ) > > Fields: text (Holds any text) > TextAscii < calculation ; text ; MakeAscii ( text ; length ( text ) ) > > RESULT > > now create the following script: createSUM with the following steps. > Loop > Set Variable [ $Source; Value:textAscii ] > Loop > Set Variable [ $A; Value:If ( IsEmpty ($A ) ; LeftValues ( $Source ; 1) + > 1 ; $A + LeftValues ( $Source ; 1) ) ] > Set Variable [ $B; Value:If ( IsEmpty ( $B) ; $A ; $B + $A ) ] > Set Variable [ $Source; Value:RightValues ( $Source ; ValueCount ( > $Source ) -1 ) ] > Exit Loop If [ IsEmpty ( $Source) ] > End Loop > Set Field [ test::result; ($B * 65536) + $A ] > goto record (next/exit after last) > end loop > > Done > > Keep well, Ursus > > "Christoph Bouthillier" > news:4b0eb$49137465$c3f0d641$8647@news.speedlinq.n l... >> Dear Listeners: >> >> I have a database with banking records (date, amount, paid/received, bank >> account from, bank account to, 2 fields with comments, currency code), >> with a total length of 263 bytes each, distributed over 9 fields with >> variable length (sometimes even empty). >> >> I need to weed out any possible 100% duplicates that may have been/will >> be imported by "mishap". The non-duplicate information can unfortunately >> be positioned anywhere along the stretch of those 263 bytes. E.g. byte >> 211 may be the only one being different between e.g. 2 or 3 otherwise >> identical records. Therefore the usual duplicate search with "!" is not >> working due to the great length of the string, even not in FMP 9 with its >> 100 bytes "discrimination" capability. >> >> I now thought of using a CRC/Cyclic Redundancy Check *per record* to find >> those 100% duplicates. Is that >> a) a good idea >> and >> b) feasible within an FMP calc field, and if so, can someone point me >> into the right direction how to implement this? >> >> I am also thinking of using some existing outside tool - a command line >> (Windows) to which FMP can write and from which it could read. Does >> anybody have any good advice on that? >> >> This is preferably FMP 6 but FMP 9 would be possible, too. >> >> The time a CRC calculation of all records would take is not critical; the >> number of records would be in the order of magnitude of several 10,000's. >> >> Your suggestions are very welcome. >> >> -- >> Met vriendelijke groet / Mit freundlichen Gruessen / With kind regards >> Christoph Bouthillier >> p o s t <> oh-no-spam t e k s t o t a a l << d o t >> c o m >> Forget the oh-no-spam >> > > |
|
#8
| |||
| |||
|
Seriously.. there is a much more elegant and simple solution.. it's called an MD5 Hash. A "hash" is a one way calculation, based on a certain set of data.. one-way, because you cannot reconstruct the original data from the hash result. But, if two data blocks generate the same hash value, there is an extremely high probability that the data blocks contain the same data. There's a free plugin from 360works that looks very interesting, and it includes the MD5 function. I'm going to check it out too.. http://www.360works.com/scriptmaster/ MD5 information http://en.wikipedia.org/wiki/MD5 Here are some samples, from Wikipedia The 128-bit (16-byte) MD5 hashes (also termed message digests) are typically represented as a sequence of 32 hexadecimal digits. The following demonstrates a 43-byte ASCII input and the corresponding MD5 hash: MD5("The quick brown fox jumps over the lazy dog") = 9e107d9d372bb6826bd81d3542a419d6 Even a small change in the message will (with overwhelming probability) result in a completely different hash, due to the avalanche effect. For example, adding a period to the end of the sentence: MD5("The quick brown fox jumps over the lazy dog.") = e4d909c290d0fb1ca068ffaddf22cbd0 The hash of the zero-length string is: MD5("") = d41d8cd98f00b204e9800998ecf8427e NOTE: I understand that FileMaker has some limitations on how many significant characters in an text variable can be indexed, so you may need to resort to storing the hash value in two or more fields.. Also, I would construct my data block as such field1 field2 field3 .... This should help segregate each field and make the length of each field less relevant. Just as an additional step. Hope this helps.. let us know how it works out.. |
|
#9
| |||
| |||
|
The so called FREE version is a demo version that runs for only two hours after each FMP start. Which makes use pretty limited, but might be just enough for your needs. Ursus "m. alam" news:3c105abc-44ce-4c3a-a487-64f5b5946d0b@c36g2000prc.googlegroups.com... > Seriously.. there is a much more elegant and simple solution.. it's > called an MD5 Hash. > A "hash" is a one way calculation, based on a certain set of data.. > one-way, because you > cannot reconstruct the original data from the hash result. But, if two > data blocks generate > the same hash value, there is an extremely high probability that the > data blocks contain > the same data. > > There's a free plugin from 360works that looks very interesting, and > it includes the MD5 > function. I'm going to check it out too.. > > http://www.360works.com/scriptmaster/ > > MD5 information > http://en.wikipedia.org/wiki/MD5 > > Here are some samples, from Wikipedia > > The 128-bit (16-byte) MD5 hashes (also termed message digests) are > typically represented as a sequence of 32 hexadecimal digits. The > following demonstrates a 43-byte ASCII input and the corresponding MD5 > hash: > > MD5("The quick brown fox jumps over the lazy dog") > = 9e107d9d372bb6826bd81d3542a419d6 > > Even a small change in the message will (with overwhelming > probability) result in a completely different hash, due to the > avalanche effect. > For example, adding a period to the end of the sentence: > > MD5("The quick brown fox jumps over the lazy dog.") > = e4d909c290d0fb1ca068ffaddf22cbd0 > > The hash of the zero-length string is: > > MD5("") > = d41d8cd98f00b204e9800998ecf8427e > > > NOTE: I understand that FileMaker has some limitations on how many > significant characters in an text variable can be indexed, so you may > need to resort to storing the hash value in two or more fields.. > > Also, I would construct my data block as such > field1 > field2 > field3 > ... > > This should help segregate each field and make the length of each > field less relevant. Just as an additional step. > > Hope this helps.. let us know how it works out.. |
|
#10
| |||
| |||
|
Me again, Just found a free MD5 solution on: http://www.unwwwired.net/projects/md5 I tinkered a bit with it, but found some problems with large text fields, but for smaller fields it works like a charm Ursus "Ursus" news:239e8$491805bb$5ed017f3$5880@cache6.tilbu1.nb .home.nl... > The so called FREE version is a demo version that runs for only two hours > after each FMP start. Which makes use pretty limited, but might be just > enough for your needs. > > Ursus > > "m. alam" > news:3c105abc-44ce-4c3a-a487-64f5b5946d0b@c36g2000prc.googlegroups.com... >> Seriously.. there is a much more elegant and simple solution.. it's >> called an MD5 Hash. >> A "hash" is a one way calculation, based on a certain set of data.. >> one-way, because you >> cannot reconstruct the original data from the hash result. But, if two >> data blocks generate >> the same hash value, there is an extremely high probability that the >> data blocks contain >> the same data. >> >> There's a free plugin from 360works that looks very interesting, and >> it includes the MD5 >> function. I'm going to check it out too.. >> >> http://www.360works.com/scriptmaster/ >> >> MD5 information >> http://en.wikipedia.org/wiki/MD5 >> >> Here are some samples, from Wikipedia >> >> The 128-bit (16-byte) MD5 hashes (also termed message digests) are >> typically represented as a sequence of 32 hexadecimal digits. The >> following demonstrates a 43-byte ASCII input and the corresponding MD5 >> hash: >> >> MD5("The quick brown fox jumps over the lazy dog") >> = 9e107d9d372bb6826bd81d3542a419d6 >> >> Even a small change in the message will (with overwhelming >> probability) result in a completely different hash, due to the >> avalanche effect. >> For example, adding a period to the end of the sentence: >> >> MD5("The quick brown fox jumps over the lazy dog.") >> = e4d909c290d0fb1ca068ffaddf22cbd0 >> >> The hash of the zero-length string is: >> >> MD5("") >> = d41d8cd98f00b204e9800998ecf8427e >> >> >> NOTE: I understand that FileMaker has some limitations on how many >> significant characters in an text variable can be indexed, so you may >> need to resort to storing the hash value in two or more fields.. >> >> Also, I would construct my data block as such >> field1 >> field2 >> field3 >> ... >> >> This should help segregate each field and make the length of each >> field less relevant. Just as an additional step. >> >> Hope this helps.. let us know how it works out.. > > |
![]() |
« Previous Thread
|
Next Thread »
| Thread Tools | |
| Display Modes | |
| |
All times are GMT -4. The time now is 09:48 AM.




Linear Mode