[Tshwanelex-l] importing extra PCDATA not in original xml import file

Sargon Hasso dshasso at gmail.com
Wed Nov 7 20:08:31 EST 2012


Yep, that did it. I noticed that all white spaces (with the exception of line feeds) were space characters. I used xmlpad to construct the sample XML file. In notepad++ I was able to see all the space characters and I converted those to tabs, since I noticed when I export from tlex tabs are used for indentations, I was able to import cleanly. 

Sargon

On Nov 7, 2012, at 11:27 AM, "David Joffe" <david.joffe at tshwanedje.com> wrote:

> Hi Sargon,
> 
> Hmm, it seems as though the XML importer is picking up the 
> 'whitespace' - i.e. the 'indentation' characters - as content, e.g. 
> if you have:
> 
> <Sense>
>    <Definition>
> 
> then it is incorrectly picking up the spacing character(s) in front 
> of the <Definition> as content, instead of ignoring them. I'm not 
> sure why, it shouldn't be doing this (we'll have a look at why it's 
> happening), but a temporary workaround, if possible, is to modify 
> the XML to be imported to remove the extra spacing, e.g. have it all 
> on one line per entry:
> 
> <Sense><Definition>...
> 
> - David
> 
> 
> On 4 Nov 2012 at 11:30, Sargon Hasso wrote:
> 
> From:    "Sargon Hasso" <dshasso at gmail.com>
> To:    <tshwanelex-l at mailman.yale.edu>
> Date sent:    Sun, 4 Nov 2012 11:30:29 -0600
> Subject:    Re: [Tshwanelex-l] importing extra PCDATA not in original xml
>    import file
> 
>>    I must have missed attaching the xml file.
>>    From: Sargon Hasso [mailto:dshasso at gmail.com]
>>    Sent: Sunday, November 04, 2012 11:13 AM
>>    To: 'tshwanelex-l at mailman.yale.edu'
>>    Subject: importing extra PCDATA not in original xml import file
>>    I am importing lemma entries from an xml file and I followed instructions in the Tlex manual; 
>>    however, I am seeing an extra blank entry after each sense marked up as PCDATA and manifests 
>>    itself as a blank entry, e.g. ' '.
>> 
>> 
>>    I am enclosing my xml file for reference. How do I get rid of this extra entry?
>>    This xml file is just experimental and I am planning to import more than 6000 entries so it is not 
>>    few entries that I could manually clean up.
>>    This is an Enlglish-Syriac-Arabic dictionary. Syriac, like Arabic, is RTL script.
>>    Regards,
>>    Sargon
> 
> 


More information about the Tshwanelex-l mailing list