Why calculate string type's duplication instance

Use this forum for questions on how to use .NET Memory Profiler and how to analyse memory usage.
Post Reply
Shawn
Posts: 5
Joined: Thu Nov 14, 2013 7:38 am

Why calculate string type's duplication instance

Post by Shawn » Thu Nov 14, 2013 7:46 am

Hi,
First of all, I love this tool very much and will buy soon once I get approved from manager(already under process).

I have 2 questions:
1. All the string instances will be default interning in .NET(I use .NET 3.5), so even I created many duplicate content strings, they actually reference to one instance and no memory truely wasted, so why the profiler always report a duplication usage of strings?

2. the question was posted at Stackoverflow days ago, please help to take a look at:
http://stackoverflow.com/questions/1985 ... bject-heap

thanks.

Andreas Suurkuusk
Posts: 1029
Joined: Wed Mar 02, 2005 7:53 pm

Re: Why calculate string type's duplication instance

Post by Andreas Suurkuusk » Thu Nov 14, 2013 10:14 am

1. It is only literal strings in an assembly that is interned by default. If you create a string by any other means, e.g. by using StringBuilder or concatenating string, the resulting string will not be interned. You can intern the string manually by calling string.Intern, but that can have serious memory usage consequences. For more information, see the documentation for string.Intern.
So, unless you intern all your strings, you can (and will) have duplicate strings that use separate storage for the string data.

2. I added a reply to your StackOverflow post.
Best regards,

Andreas Suurkuusk
SciTech Software AB

Shawn
Posts: 5
Joined: Thu Nov 14, 2013 7:38 am

Re: Why calculate string type's duplication instance

Post by Shawn » Fri Nov 15, 2013 2:03 am

Andreas Suurkuusk wrote:1. It is only literal strings in an assembly that is interned by default. If you create a string by any other means, e.g. by using StringBuilder or concatenating string, the resulting string will not be interned. You can intern the string manually by calling string.Intern, but that can have serious memory usage consequences. For more information, see the documentation for string.Intern.
So, unless you intern all your strings, you can (and will) have duplicate strings that use separate storage for the string data.

2. I added a reply to your StackOverflow post.
Regarding the question 1, let's say now the Profiler report a string have 10k duplicated instances, does the 10k exclude the interned ones? if so, then it means I do have 10K same content strings which composed by string builder??

take this simplified code from my project, now I see the profiler report a bulk duplication of TransactionItem.Description string content. could you explain?

Code: Select all

   public class TransactionItem
    {
        public string Description;
        public TransactionItem(string desc)
        {
            this.Description = desc;
        }
    }
    class Program
    {
        static void Main(string[] args)
        {
            var simpleList = new List<TransactionItem>();
            for (int i = 0; i < 1000; i++)
            {
                simpleList.Add(new TransactionItem("TrxItem description."));
            }

            Console.WriteLine("object.ReferenceEquals(0, 1)   " + object.ReferenceEquals(simpleList[0].Description, simpleList[1].Description));
            // always show 'True'
        }
    }

Andreas Suurkuusk
Posts: 1029
Joined: Wed Mar 02, 2005 7:53 pm

Re: Why calculate string type's duplication instance

Post by Andreas Suurkuusk » Fri Nov 15, 2013 9:08 am

No, the profiler does not exclude the interned strings. If you have 10,000 duplicate strings and one of the strings is interned, it will still be presented as 10,000 duplicate strings and not 9,999. Note that there can only be one interned string in a set of duplicate strings.

Your example should not cause any duplicated strings, since you use the same string for each TransactionItem. I ran your code in the profiler (I added a Console.ReadLine at the end, to be able collect a snapshot). As expected, there were no duplicated strings related to TransactionItem (only two small sets created by the framework).
NonDuplicateStrings.png
No duplicate strings
The TransactionItems on the other hand are reported as duplicate instances, since they're are separate instances with the same content.
DuplicateTransactions.png
Duplicate transactions.
I modified your code slightly to show how duplicate strings can be created, even though the string is interned.

Code: Select all

public class TransactionItem
{
    public string Description;
    public TransactionItem(string desc)
    {
        this.Description = desc;
    }
}

class Program
{
    static void  Main(string[] args)
    {
        string internedItemDescription = "TrxItem description.";
        if (string.IsInterned(internedItemDescription) != null)
        {
            Console.WriteLine("Item description is interned");
        }

        var simpleList = new List<TransactionItem>();
            
        string itemName = "TrxItem";
        for (int i = 0; i < 1000; i++)
        {
            // Build a duplicate description instead of using the interned description.
            string itemDescription = itemName + " description.";
            simpleList.Add(new TransactionItem(itemDescription));
        }

        Console.WriteLine("object.ReferenceEquals(0, 1)   " + object.ReferenceEquals(simpleList[0].Description, simpleList[1].Description));

        // Wait for user interaction, to allow a snapshot to be collected.
        Console.ReadLine();

        // Make sure that the simpleList is not optimized away.
        GC.KeepAlive(simpleList);
    }
}
To make the result clearer I compiled the code as Release and added the GC.KeepAlive call to avoid that the simpleList gets garbage collected before collecting the snapshot. (Under a Debug build, all variables are kept alive until the end of the method, which would cause additional root paths to be presented in the snapshot.)

The profiler will present 1,001 duplicate instances of the string "TrxItem description.". 1,000 instances created for the TransactionItems and one interned string.
DuplicateStrings.png
Duplicated strings
DuplicateStrings.png (24.66 KiB) Viewed 14200 times
The root pats of the duplicated string show this:
StringsRootPath1.png
Root path for TransactionItem strings
Attachments
StringsRootPath2.png
Root path for interned string
StringsRootPath2.png (6.91 KiB) Viewed 14200 times
Best regards,

Andreas Suurkuusk
SciTech Software AB

Shawn
Posts: 5
Joined: Thu Nov 14, 2013 7:38 am

Re: Why calculate string type's duplication instance

Post by Shawn » Mon Nov 18, 2013 5:44 am

Andreas Suurkuusk wrote:No, the profiler does not exclude the interned strings. If you have 10,000 duplicate strings and one of the strings is interned, it will still be presented as 10,000 duplicate strings and not 9,999. Note that there can only be one interned string in a set of duplicate strings.

Your example should not cause any duplicated strings, since you use the same string for each TransactionItem. I ran your code in the profiler (I added a Console.ReadLine at the end, to be able collect a snapshot). As expected, there were no duplicated strings related to TransactionItem (only two small sets created by the framework).
The attachment NonDuplicateStrings.png is no longer available
The TransactionItems on the other hand are reported as duplicate instances, since they're are separate instances with the same content.
The attachment DuplicateTransactions.png is no longer available
I modified your code slightly to show how duplicate strings can be created, even though the string is interned.

Code: Select all

public class TransactionItem
{
    public string Description;
    public TransactionItem(string desc)
    {
        this.Description = desc;
    }
}

class Program
{
    static void  Main(string[] args)
    {
        string internedItemDescription = "TrxItem description.";
        if (string.IsInterned(internedItemDescription) != null)
        {
            Console.WriteLine("Item description is interned");
        }

        var simpleList = new List<TransactionItem>();
            
        string itemName = "TrxItem";
        for (int i = 0; i < 1000; i++)
        {
            // Build a duplicate description instead of using the interned description.
            string itemDescription = itemName + " description.";
            simpleList.Add(new TransactionItem(itemDescription));
        }

        Console.WriteLine("object.ReferenceEquals(0, 1)   " + object.ReferenceEquals(simpleList[0].Description, simpleList[1].Description));

        // Wait for user interaction, to allow a snapshot to be collected.
        Console.ReadLine();

        // Make sure that the simpleList is not optimized away.
        GC.KeepAlive(simpleList);
    }
}
To make the result clearer I compiled the code as Release and added the GC.KeepAlive call to avoid that the simpleList gets garbage collected before collecting the snapshot. (Under a Debug build, all variables are kept alive until the end of the method, which would cause additional root paths to be presented in the snapshot.)

The profiler will present 1,001 duplicate instances of the string "TrxItem description.". 1,000 instances created for the TransactionItems and one interned string.
The attachment DuplicateStrings.png is no longer available
The root pats of the duplicated string show this:
The attachment StringsRootPath1.png is no longer available
Really appreciate for the trying replication, and sorry I didn't test that sample code through the profiler before send the post, but after looking your investigation, I reviewed the source code many times, I'm pretty sure the sample followed our production code, which means there's no

Code: Select all

StringBuilder.Append("...")
and no normal string variable concat like

Code: Select all

str1 = str2 + "..."
as you showed,

it just passed in a whole string into constructor

Code: Select all

TransactionItem(string desc)
and inside the constructor it just simply set the value to a public Property.

since the memory dump was from lab stress testing, it really confused me how to explain:
duplicateStrings.jpg
duplicate strings

Andreas Suurkuusk
Posts: 1029
Joined: Wed Mar 02, 2005 7:53 pm

Re: Why calculate string type's duplication instance

Post by Andreas Suurkuusk » Mon Nov 18, 2013 1:31 pm

Somehow it seems like you do create duplicate instances of your string. As I have not seen your code I cannot explain how it happens, but I believe it's unlikely that the runtime somehow creates implicit copies of the same string or that the profiler finds non-existent copies of the string.

Are you actually using a literal string as the TransactionItem description? If you retrieve the string from some other source, e.g. a UI-element, then a new string might get created each time you retrieve it.

Have you tried to test this when running under the profiler and not just when importing a memory dump? If you run under the profiler you will be able to see the allocation call stacks of the strings. This should help you find out how the strings are created.
Best regards,

Andreas Suurkuusk
SciTech Software AB

Shawn
Posts: 5
Joined: Thu Nov 14, 2013 7:38 am

Re: Why calculate string type's duplication instance

Post by Shawn » Tue Nov 19, 2013 2:58 am

Andreas Suurkuusk wrote:Somehow it seems like you do create duplicate instances of your string. As I have not seen your code I cannot explain how it happens, but I believe it's unlikely that the runtime somehow creates implicit copies of the same string or that the profiler finds non-existent copies of the string.

Are you actually using a literal string as the TransactionItem description? If you retrieve the string from some other source, e.g. a UI-element, then a new string might get created each time you retrieve it.

Have you tried to test this when running under the profiler and not just when importing a memory dump? If you run under the profiler you will be able to see the allocation call stacks of the strings. This should help you find out how the strings are created.
I think finally I got the answer which from the code I didn't pasted, there's another constructor for TransactionItem which have only one parameter of XmlNode contains the 'description' property value de-serialized from a persisted xml file.

So the behind code logic in my program is: call the normal TransactionItem constructor 1000 times, all 'description' will reference to a same interned string as we analysed in proceeding posts, then 1000 objects were saved to files, and later sometime(say when system is idle), a de-serialize process will run 1000 times to rebuild all objects by a code like:

Code: Select all

TransactionItem item = new TransactionItem("");
xd.Load(xmlFile.FullName);
var node = xd.DocumentElement.SelectSingleNode(@"Description");
// !!! No interning happen! the duplicated strings were bring in here!
item.Description = node.InnerText;


So I'm a bit confused about the condition for auto interning, I before thought every place we create strings(except the 'append' and 'stringBuilder') both for explicitly or implicitly(like here load a file content string into a Property), the CLR will try to interning, because the quote from msdn:
to each unique literal string declared or created programmatically in your program
then I reasonable suspect the source code inside the property of node.InnerText must like:

Code: Select all

public string InnerText
{
    get{
          string text = unManagedAPICallToRetrieveFileAndThenResolveXmlNodeContent();
          return text;
         }
}

but why that text didn't get interned automatically?


BTW, could you explain more on this:
a UI-element, then a new string might get created each time you retrieve it


thanks.

Andreas Suurkuusk
Posts: 1029
Joined: Wed Mar 02, 2005 7:53 pm

Re: Why calculate string type's duplication instance

Post by Andreas Suurkuusk » Tue Nov 19, 2013 10:08 pm

The InnerText property of an XmlNode will never return an interned string. The text string in the node has been retrieved from a file (or some other stream), and will not automatically be interned by the framework or .NET runtime. Depending on the implementation, the InnerText property might return the same string each time, or a new string might be created for each get. But the string instance will not be the same for different nodes, even if the text is the same.

The documentation for string.Intern is clearly a bit confusing. I'm not sure what they mean with a "programmatically" created literal string. A literal string is stored within the assembly and includes the strings within quotation marks in your program.

An interned string can never be garbage collected, so interning strings loaded from a file could cause serious memory usage problems. I strongly recommend that you don't intern strings loaded from an XML-file, unless you know that it is a common string that will be used over the life-time of the application.

Instead I recommend that you take another approach to avoid duplicate instances.

After we wrote the duplicate instances detector we of course tested it on the profiler itself, and we found a lot of duplicate instances. To avoid the duplicate instances we wrote two container classes that help us avoid the duplicate instances: SingleValueContainer and WeakSingleValueContainer. I have attached a zip-file that includes slightly modified versions of these containers. You may use them in your project if you wish. Just note that have I modified them from our original code and the modifications have not been thoroughly tested yet. Use them at your own risk :)

The SingleValueContainer is suitable to use when you have a clearly defined "region" where you want to avoid duplicate instances. This can for instance be when you open an XML-file or other document that includes a lot of duplicate string or other duplicate instances.

Code: Select all

private void LoadTransactionItems()
{
    SingleValueContainer<string> commonStrings = new SingleValueContainer<string>();

    // ...
    XmlNode xmlNode = ...;
    list.Add( new TransactionItem(commonStrings[xmlNode.InnerText]) );
    // ....
}
The SingleValueContainer indexer provides a single copy of the provided element (based on IEqualityComparer) similar to the string.Intern method (but the container handles any class, not just strings). However, as soon as the container is cleared or GCed, the elements in the container are also eligible for collection.

The WeakSingleValueContainer is similar to the SingleValueContainer, but it only keeps a weak reference to the elements, so there's no need to explicitly clear the container. However, the overhead is significantly higher for the WeakSingleValueContainer, since a weak GC handle is created for each unique item. You can use the WeakSingleValueContainer when the "region" is not as clearly defined and/or when you expect to have many duplicates and only a few unique instances.

I hope this helps. Reply to this post if you need any additional information or help on how to use the containers.
Attachments
SingleValueContainer.zip
SingleValueContainer
(5.31 KiB) Downloaded 397 times
Best regards,

Andreas Suurkuusk
SciTech Software AB

Post Reply

Who is online

Users browsing this forum: Bing [Bot] and 20 guests