| Comparing two data files and writing to a diff file (Request help) | |
|
Author | Message |
---|
Servant Member
Posts : 4 Join date : 2011-05-20
| Subject: Comparing two data files and writing to a diff file (Request help) Thu Jun 23, 2011 7:35 am | |
| Hi Everyone, I have two files i.e. one file2 contains today's data and the other file1 contains Yesterday's data. The data in the files contains 226 columns and the data for the coulums separated by a Pipe "|" delimiter. Now, I have 4 Primary keys (coulumns) by which I have to compare file2 and file1 and generate a diff file that is the data that is not present in file2 but present in file1 and vice versa. The sample of the data that contains in file1 is ( 1 record): - Code:
-
START-OF-FILE FILENAME=fixedincome_bo_namr.dif DATA=bo REGION=namr TYPE=dif PROGRAMNAME=getdata DATEFORMAT=yyyymmdd
START-OF-FIELDS # Security description TICKER CPN MATURITY SERIES NAME SHORT_NAME ISSUER_INDUSTRY MARKET_SECTOR_DES CPN_FREQ CPN_TYP MTY_TYP CALC_TYP_DES DAY_CNT MARKET_ISSUE COUNTRY CRNCY COLLAT_TYP AMT_ISSUED AMT_OUTSTANDING MIN_PIECE MIN_INCREMENT PAR_AMT LEAD_MGR EXCH_CODE REDEMP_VAL
# Issuance information ANNOUNCE_DT FIRST_SETTLE_DT FIRST_CPN_DT INT_ACC_DT ISSUE_DT ISSUE_PX
# Identifiers ID_EUROCLEAR ID_XTRAKTER ID_SEDOL1 ID_SEDOL2 ID_CEDEL ID_WERTPAPIER ID_ISIN ID_DUTCH ID_VALOREN ID_FRENCH ID_COMMON ID_JAPAN ID_BELGIUM ID_DANISH ID_AUSTRIAN ID_LUXEMBOURG ID_SWEDISH ID_NORWAY ID_JAPAN_COMPANY ID_SPAIN ID_ITALY ID_BB_COMPANY ID_BB_SECURITY ID_CUSIP
# Schedules NXT_CALL_DT NXT_CALL_PX NXT_PAR_CALL_DT NXT_PUT_DT NXT_PUT_PX NXT_PAR_PUT_DT NXT_CPN_DT NXT_SINK_DT NXT_REFUND_DT
# Ratings RTG_JCR
# Floaters REFIX_FREQ NXT_REFIX_DT RESET_IDX
# Inflation-indexed bonds
# Preferreds PFD_DVD_PAY_DT PFD_EX_DVD_DT PFD_RST_DVD
# Convertibles CV_COMMON_TICKER CV_COMMON_TICKER_EXCH CV_CNVS_RATIO CV_UNTIL CV_CNVS_FEXCH_RT CV_PROV_PX
CALLABLE SINKABLE PUTABLE
ID_BB_PARENT_CO PARENT_COMP_NAME PARENT_COMP_TICKER CNTRY_OF_INCORPORATION BASIC_SPREAD INDUSTRY_SECTOR INDUSTRY_GROUP INDUSTRY_SUBGROUP COUNTRY_GUARANTOR CNTRY_OF_DOMICILE SECURITY_DES 144A_FLAG FLT_BENCH_MULTIPLIER SECURITY_TYP CV_START_DT GUARANTOR_NAME PREV_CPN_DT NXT_SINK_AMT CALL_DISCRETE PUT_DISCRETE MAKE_WHOLE_CALL ID_BB_UNIQUE LONG_COMP_NAME REDEMP_CRNCY CPN_CRNCY DTC_ELIGIBLE STRUCTURED_NOTE PCT_PAR_QUOTED PCS_QUOTE_TYP IS_UNIT_TRADED IS_REVERSE_CONVERTIBLE TRADE_CRNCY BEARER REGISTERED CALLED CALLED_DT ISSUER CALL_FEATURE PUT_FEATURE PENULTIMATE_CPN_DT FLT_CPN_CONVENTION CUR_CPN FLOATER TRADE_STATUS CDR_COUNTRY_CODE CDR_SETTLE_CODE SEASONING_STATUS FINAL_MATURITY PRVT_PLACE CALC_TYP REMOVAL_REASON IS_PERPETUAL IS_REG_S CALLED_PX DEFAULTED GILTS_EX_DVD_DT MOST_RECENT_REPORTED_FACTOR NXT_FACTOR_DT OID_BOND DELIVERY_TYP ID_SEDOL3 ID_SEDOL4 ID_SEDOL5 SEDOL1_COUNTRY_ISO SEDOL2_COUNTRY_ISO SEDOL3_COUNTRY_ISO SEDOL4_COUNTRY_ISO SEDOL5_COUNTRY_ISO ID_MIC1 ID_MIC2 ID_MIC3 ID_MIC4 ID_MIC5 CV_SH_PAR DUAL_CRNCY EXTENDIBLE EXCHANGEABLE IS_SOFT_CALL CV_MANDATORY_CNVS EU_SAVINGS_DIRECTIVE ID_CUSIP_REAL INDUSTRY_SUBGROUP_NUM SECURITY_TYP2 LAST_REFIX_DT ISO_COUNTRY_GUARANTOR DTC_REGISTERED CALL_PARTIAL CV_CNVS_PX IS_CURRENT_GOVT FIRST_CALL_DT_ISSUANCE UNDL_ID_BB_UNIQUE EST_CPN_FLAG ID_BB_GUARANTOR IS_DAY_PAYER STEPUP_CPN STEPUP_DT CALC_MATURITY ID_EXCH_SYMBOL CREDIT_ENHANCEMENTS INSURANCE_STATUS JUNIOR SENIOR FLT_PAY_DAY FLT_DAYS_PRIOR INFLATION_LINKED_INDICATOR DAYS_TO_SETTLE TYPE_OF_BOND REFERENCE_INDEX BASE_CPI CFI_CODE CPN_FREQ_YLD_CNV DAY_PAYER_FREQ EX_DIV_DAYS EX_DIV_CALENDAR CONTINGENT_CONVERSION CONTRIB_DATA_INDICATOR SECURITY_FACTORABLE ID_BB_GLOBAL END-OF-FIELDS
TIMESTARTED=Wed Jun 8 18:33:51 EDT 2011 230 START-OF-DATA 231 CP5101987 Corp|-1|198|C|7.300000|20110607|EAB|EUROPEAN AMERICAN BANK|LNB-CALL12/97|BANK|Corp|2|STEP CPN|CALLABLE|MULTI-STEP BOND|1|US DOMESTIC|US|USD|DE POSIT NOTES|760000.00|.00|10000.0000|1000.0000|1000.00|LAS-sole|NOT LISTED|100.00000|19960607|19960607|19961207|19960607|19960607|100.000000| | | | | | |US29874AZZ55| | | | | | | | | | | | | | |225433|500231|29874AZZ5| | | | | | | | | |N.A.| | | | | | | | | | | | |Y|N|N|197879|Citibank NA|8156Z|US| |Fin ancial|Banks|Money Center Banks|N.A.|US|C 7.3 06/07/11|N| |US DOMESTIC| |N.A.| | |Y| |N|COCP5101987|European American Bank|USD|USD|Y|Y|Y|1|N|N|USD|N|N|Y |19971207|EUROPEAN AMERICAN BANK|Semi-Annual| |19970607| | |N|N|US|US|Does Not Apply|20110607|N|421|MATURED|N|N|100.000000|N| |.000000000| |N|DTC| | | | N.A.|N.A.|N.A.|N.A.|N.A.| | | | | | |N|N|N|N| |Grandfathered|29874AZZ5|145| | |N.A.|N|FULL (ONLY)| |N|19970607| | | |N| | |20110607| | |N|N|N| | |N|3| | | |N.A.|2| | | | |N|N|BBG00048KQJ7|
The sample of the data that contains in file2 (Todays file) is ( 1 record): - Code:
-
START-OF-FILE FILENAME=fixedincome_bo_namr.dif DATA=bo REGION=namr TYPE=dif PROGRAMNAME=getdata DATEFORMAT=yyyymmdd
START-OF-FIELDS # Security description TICKER CPN MATURITY SERIES NAME SHORT_NAME ISSUER_INDUSTRY MARKET_SECTOR_DES CPN_FREQ CPN_TYP MTY_TYP CALC_TYP_DES DAY_CNT MARKET_ISSUE COUNTRY CRNCY COLLAT_TYP AMT_ISSUED AMT_OUTSTANDING MIN_PIECE MIN_INCREMENT PAR_AMT LEAD_MGR EXCH_CODE REDEMP_VAL
# Issuance information ANNOUNCE_DT FIRST_SETTLE_DT FIRST_CPN_DT INT_ACC_DT ISSUE_DT ISSUE_PX
# Identifiers ID_EUROCLEAR ID_XTRAKTER ID_SEDOL1 ID_SEDOL2 ID_CEDEL ID_WERTPAPIER ID_ISIN ID_DUTCH ID_VALOREN ID_FRENCH ID_COMMON ID_JAPAN ID_BELGIUM ID_DANISH ID_AUSTRIAN ID_LUXEMBOURG ID_SWEDISH ID_NORWAY ID_JAPAN_COMPANY ID_SPAIN ID_ITALY ID_BB_COMPANY ID_BB_SECURITY ID_CUSIP
# Schedules NXT_CALL_DT NXT_CALL_PX NXT_PAR_CALL_DT NXT_PUT_DT NXT_PUT_PX NXT_PAR_PUT_DT NXT_CPN_DT NXT_SINK_DT NXT_REFUND_DT
# Ratings RTG_JCR
# Floaters REFIX_FREQ NXT_REFIX_DT RESET_IDX
# Inflation-indexed bonds
# Preferreds PFD_DVD_PAY_DT PFD_EX_DVD_DT PFD_RST_DVD
# Convertibles CV_COMMON_TICKER CV_COMMON_TICKER_EXCH CV_CNVS_RATIO CV_UNTIL CV_CNVS_FEXCH_RT CV_PROV_PX
CALLABLE SINKABLE PUTABLE
ID_BB_PARENT_CO PARENT_COMP_NAME PARENT_COMP_TICKER CNTRY_OF_INCORPORATION BASIC_SPREAD INDUSTRY_SECTOR INDUSTRY_GROUP INDUSTRY_SUBGROUP COUNTRY_GUARANTOR CNTRY_OF_DOMICILE SECURITY_DES 144A_FLAG FLT_BENCH_MULTIPLIER SECURITY_TYP CV_START_DT GUARANTOR_NAME PREV_CPN_DT NXT_SINK_AMT CALL_DISCRETE PUT_DISCRETE MAKE_WHOLE_CALL ID_BB_UNIQUE LONG_COMP_NAME REDEMP_CRNCY CPN_CRNCY DTC_ELIGIBLE STRUCTURED_NOTE PCT_PAR_QUOTED PCS_QUOTE_TYP IS_UNIT_TRADED IS_REVERSE_CONVERTIBLE TRADE_CRNCY BEARER REGISTERED CALLED CALLED_DT ISSUER CALL_FEATURE PUT_FEATURE PENULTIMATE_CPN_DT FLT_CPN_CONVENTION CUR_CPN FLOATER TRADE_STATUS CDR_COUNTRY_CODE CDR_SETTLE_CODE SEASONING_STATUS FINAL_MATURITY PRVT_PLACE CALC_TYP REMOVAL_REASON IS_PERPETUAL IS_REG_S CALLED_PX DEFAULTED GILTS_EX_DVD_DT MOST_RECENT_REPORTED_FACTOR NXT_FACTOR_DT OID_BOND DELIVERY_TYP ID_SEDOL3 ID_SEDOL4 ID_SEDOL5 SEDOL1_COUNTRY_ISO SEDOL2_COUNTRY_ISO SEDOL3_COUNTRY_ISO SEDOL4_COUNTRY_ISO SEDOL5_COUNTRY_ISO ID_MIC1 ID_MIC2 ID_MIC3 ID_MIC4 ID_MIC5 CV_SH_PAR DUAL_CRNCY EXTENDIBLE EXCHANGEABLE IS_SOFT_CALL CV_MANDATORY_CNVS EU_SAVINGS_DIRECTIVE ID_CUSIP_REAL INDUSTRY_SUBGROUP_NUM SECURITY_TYP2 LAST_REFIX_DT ISO_COUNTRY_GUARANTOR DTC_REGISTERED CALL_PARTIAL CV_CNVS_PX IS_CURRENT_GOVT FIRST_CALL_DT_ISSUANCE UNDL_ID_BB_UNIQUE EST_CPN_FLAG ID_BB_GUARANTOR IS_DAY_PAYER STEPUP_CPN STEPUP_DT CALC_MATURITY ID_EXCH_SYMBOL CREDIT_ENHANCEMENTS INSURANCE_STATUS JUNIOR SENIOR FLT_PAY_DAY FLT_DAYS_PRIOR INFLATION_LINKED_INDICATOR DAYS_TO_SETTLE TYPE_OF_BOND REFERENCE_INDEX BASE_CPI CFI_CODE CPN_FREQ_YLD_CNV DAY_PAYER_FREQ EX_DIV_DAYS EX_DIV_CALENDAR CONTINGENT_CONVERSION CONTRIB_DATA_INDICATOR SECURITY_FACTORABLE ID_BB_GLOBAL END-OF-FIELDS
TIMESTARTED=Thu Jun 9 18:34:19 EDT 2011 230 START-OF-DATA 231 9999X01M9 Govt|-1|198|XIB|0|20110707| |WI TSY BILL|WI TSY BILL|USGN|Govt| |NONE|NORMAL|DISCOUNT|2|US GOVT|US|USD| |31782000000|31782000000|100|100| | | | 100.00000| | |20110106| |20110609|0.005000| | | | | | | | | | | | | | | | | | | | | |349057|13714872|9127952X8| | | | | | | | | | | | | | | | | | | | | | |N|N|N|218252|United States of America|3352Z|US| |Government|Sovereign|Sovereign| |US|XIB 07/07/11| | |US GOVERNMENT| | | | | | |N.A.|GV9999X01M9|United States Treasury Bill - WI Post Auction|USD|USD| | |Y|2|N| |USD|N.A.|N.A.|N| |WI TSY BILL| | | | | |N|Y|US|US| |20110707|N.A.|5| | | | | | | | |Y| | | | |N.A.|N.A.|N.A.|N.A.|N.A.| | | | | | | | |N.A.|N| |Non-Grandfathered|9999X01M9|459|Bill| | | | | |N| | | | |N| | |20110707| | | | | | | |N|1| | | | | | | | | | |N|BBG001CSH9Y7|
Since I have 4 primary keys, I though of doing the code using Hashes. Could someone please help me out. I am looking for a desperate help on this. Really appreciated if someone could help out with some ideas. |
|
| |
Gold Member
Posts : 11 Join date : 2011-05-19
| Subject: Re: Comparing two data files and writing to a diff file (Request help) Thu Jun 23, 2011 7:37 am | |
| |
|
| |
Servant Member
Posts : 4 Join date : 2011-05-20
| Subject: Re: Comparing two data files and writing to a diff file (Request help) Thu Jun 23, 2011 7:37 am | |
| Thanks a lot for your reply. The document is really helpful to start of with.
Could you please suggest me whether my approach is correct i.e. Since I have 4 primary keys, I would match according to these and if doesn't exists in file2 then the entire record is written in the Outfile.
If so,
I have a small issue here.
For Example:
The Outfile contains the difference of the records in the file1 and file2. i.e. If the record exists in file1 and doesnot exists in file2 and vice versa.
Here -- If the record exists in file1 and the same record exists 4 times in file2 then we dont want to loose any records.
Could you please suggest me how to overcome this issue. |
|
| |
Gold Member
Posts : 11 Join date : 2011-05-19
| Subject: Re: Comparing two data files and writing to a diff file (Request help) Thu Jun 23, 2011 7:39 am | |
| - Servant wrote:
- Thanks a lot for your reply. The document is really helpful to start of with.
Could you please suggest me whether my approach is correct i.e. Since I have 4 primary keys, I would match according to these and if doesn't exists in file2 then the entire record is written in the Outfile.
If so,
I have a small issue here.
For Example:
The Outfile contains the difference of the records in the file1 and file2. i.e. If the record exists in file1 and doesnot exists in file2 and vice versa.
Here -- If the record exists in file1 and the same record exists 4 times in file2 then we dont want to loose any records.
Could you please suggest me how to overcome this issue. I think we're suffering from a difference in terminology here. Any database table or record system I use contains a unique "primary" key that can be used to identify any particular record. I believe you are describing secondary keys. So my question is whether each combination of 4 secondary keys constitutes a unique record? If this is the case, than your problem is easy, and can be solved by the resource I showed above. If not, then you're going to have to determine rules on when you're going to exclude certain records. Either way, we'd need more information in order to advise you further. |
|
| |
Servant Member
Posts : 4 Join date : 2011-05-20
| Subject: Re: Comparing two data files and writing to a diff file (Request help) Thu Jun 23, 2011 7:40 am | |
| I think I did not communicate properly.
Below is the example that I want to overcome:
The primary key is the 1st column.
The data in the file1:
1|2|3|4|5|6 2|4|5|6|6|7 3|5|2|7|7|7
The data in file2:
4|6|6|8|7|8 2|7|5|3|6|2 5|3|6|4|7|3 2|4|5|7|7|8
Outputfile: 1|2|3|4|5|6 3|5|2|7|7|7 4|6|6|8|7|8 5|3|6|4|7|3 2|7|5|3|6|2 2|4|5|7|7|8
Since we are comparing the files depending upon the primary key (1st column), there exists 2 records in file2 .
Here we dont want to miss any records.
Inorder to overcome this do you think we have to compare the secondary key if the primary keys match ?
Could you plese suggest on this.
Thanks a lot for your reply. |
|
| |
Servant Member
Posts : 4 Join date : 2011-05-20
| Subject: Re: Comparing two data files and writing to a diff file (Request help) Thu Jun 23, 2011 7:41 am | |
| I have managed to write a perl script for the diff file. i.e. - Code:
-
#!/usr/local/bin/perl $self = $0; $self =~ s!^.*/!!; # $[ = 1; # = number of first index into arrays and strings # $FIELD_SEPARATOR = '\t'; $FIELD_NUMBER_LIST =('38','82');
$field_separator = $FIELD_SEPARATOR; $field_number_list = $FIELD_NUMBER_LIST; # while (@ARGV) { $_ = shift; if (/^-F$/) { $field_separator = shift; } elsif (/^-L$/) { $field_number_list = shift; } elsif (/^-F.+$/) { $field_separator = substr($_,$[+2); } elsif (/^-L.+$/) { $field_number_list = substr($_,$[+2); } #else { push(@filename, $_); } } # $file_a = 'file1'; $file_b = 'file2'; # unless (($file_a ne "") && (-f $file_a)) { die "Error: Can't find file '$file_a'!\n"; } unless (($file_b ne "") && (-f $file_b)) { die "Error: Can't find file '$file_b'!\n"; } # @index_list = split(/,/, $field_number_list); # # Scan first file, Pass 1: open(FILE_A, "<$file_a") || die "Can't open '$file_a': $!\n"; # while (<FILE_A>) { chop if /\n$/; undef $key; undef @field; @field = split(/$field_separator/o); foreach $index (@index_list) { if (defined $key) { $key .= "\n" . $field[$index]; } else { $key = $field[$index]; } } $intersection{$key} = 1;
} # close(FILE_A); # Scan second file, Pass 1: # $empty_intersection = 1; # open(FILE_B, "<$file_b") || die "Can't open '$file_b': $!\n"; # while (<FILE_B>) { chop if /\n$/; undef $key; undef @field; @field = split(/$field_separator/o); foreach $index (@index_list) { if (defined $key) { $key .= "\n" . $field[$index]; } else { $key = $field[$index]; } } $code = $intersection{$key};
if ($code == 1) { $intersection{$key} = 3; $empty_intersection = 0; } else { if ($code != 3) { $intersection{$key} = 2; } } } # close(FILE_B); # # Prepare output file names: $file_a_1 = $file_a . '.1'; #
# Scan first file, Pass 2: # open(FILE_A, "<$file_a") || die "Can't open '$file_a': $!\n"; open(FILE_A_1, ">$file_a_1") || die "Can't write '$file_a_1': $!\n"; # while (<FILE_A>) { chop if /\n$/; undef $key; undef @field; @field = split(/$field_separator/o); foreach $index (@index_list) { if (defined $key) { $key .= "\n" . $field[$index]; } else { $key = $field[$index]; } } if ($intersection{$key} == 3) { #
} else { print FILE_A_1 $_, "\n"; } } # close(FILE_A); close(FILE_A_1); # # Scan second file, Pass 2: # open(FILE_B, "<$file_b") || die "Can't open '$file_b': $!\n"; open(FILE_A_1, ">>$file_a_1") || die "Can't write '$file_a_1': $!\n"; # while (<FILE_B>) { chop if /\n$/; undef $key; undef @field; @field = split(/$field_separator/o); foreach $index (@index_list) { if (defined $key) { $key .= "\n" . $field[$index]; } else { $key = $field[$index]; } } if ($intersection{$key} == 3) { # } else { print FILE_A_1 $_, "\n"; } } # close(FILE_B); close(FILE_A_1); # # Display results: # printf("The Diff file created '%s'\n\n", $file_a_1); #
The above code works perfectly for generating the diff file i.e. depending upon the primary keys (here taken 2) the outfile contains the records that exists in file1 but not in file2 and the records that exists in file2 but not in file1. Now, I need to compare the whole record(line) if the primary keys in file1 matches with the primary keys in file2. If both the lines are equal then discard else write to the outfile. Could someone please help me out in order to the above step. Really appreciate your thoughts on this. |
|
| |
Gold Member
Posts : 11 Join date : 2011-05-19
| Subject: Re: Comparing two data files and writing to a diff file (Request help) Thu Jun 23, 2011 7:43 am | |
| Sorry mate, that really is too much code for me to analyze. I would have a few peripheral suggestions to make to you though: 1) Add use strict; and use warnings; to the beginning of this and every script that you create. It will require you to alter your coding style to always declare your variables and will of course require fixes to this script. However, it will save you so much effort in the end and will make it easier for others to look at your code in the future as well. 2) Instead of rolling your own parameter processing, I suggest that you implement your script using Getopt::LongThere are two reasons why this is useful. For one, it will save you effort in not having to reinvent this particular wheel now and in the future. However, it will also help you follow the command line standards for passing parameters to scripts and programs. Following those standards will make it much more likely that people will use your script as it's intended. Good luck, |
|
| |
Sponsored content
| Subject: Re: Comparing two data files and writing to a diff file (Request help) | |
| |
|
| |
| Comparing two data files and writing to a diff file (Request help) | |
|